Crime prediction is a complex problem because there are many random human factors involved. This report is about retracing historical crime events in Chicago and creating a model that attempts to make predictions about the number of crimes in the community. The predictions are based on multiple open data sources that were combined before. The primary data source is crime incident data from the Chicago Police Department for the period 2015-2019. In terms of features, I currently include socioeconomic features, temporal features (e.g., hour, day and month), and nearest neighbor features. The prediction model is a prediction of the crime rate for each community.
Before diving into prediction, I first need to collect and clean the dataset, and then preprocess the data to ensure that it is in a usable format.
import folium
import altair as alt
import numpy as np
import pandas as pd
import geopandas as gpd
import plotly.express as px
import plotly.io as pio
import plotly.graph_objects as go
import hvplot.pandas
import esri2gpd
import carto2gpd
import seaborn as sns
import osmnx as ox
import datetime
from IPython.display import IFrame
from matplotlib import pyplot as plt
from shapely.geometry import Point
# Models
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
# perform the K-Means fit
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
# Model selection
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
# Pipelines
from sklearn.pipeline import make_pipeline
# Preprocessing
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, PolynomialFeatures, OneHotEncoder
# Neighbors
from sklearn.neighbors import NearestNeighbors
/Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/geopandas/_compat.py:111: UserWarning: The Shapely GEOS version (3.11.0-CAPI-1.17.0) is incompatible with the GEOS version PyGEOS was compiled with (3.10.4-CAPI-1.16.2). Conversions between both will be slow. warnings.warn(
Here, I collect crime data from 2015 to 2019. Some data from recent years were not selected because key features were missing. The dataset is in a tabular form and includes chronological, geographical and text data, which extracted from the Chicago Police Department's CLEAR (Citizen Law Enforcement Analysis and Reporting) system. All data visualizations on maps should be considered approximate and attempts to derive specific addresses are strictly prohibited. The dataset contains more than 1,000,000 records/rows of data.
# import 5-year crime data from chicago data portal
pd.set_option('display.max_columns', None)
url = 'https://data.cityofchicago.org/resource/kf95-mnd6.json?$limit=999999'
crime2016 = pd.read_json(url)
url = 'https://data.cityofchicago.org/resource/d62x-nvdr.json?$limit=999999'
crime2017 = pd.read_json(url)
url = 'https://data.cityofchicago.org/resource/3i3m-jwuy.json?$limit=999999'
crime2018 = pd.read_json(url)
url = 'https://data.cityofchicago.org/resource/w98m-zvie.json?$limit=999999'
crime2019 = pd.read_json(url)
url = 'https://data.cityofchicago.org/resource/vwwp-7yr9.json?$limit=999999'
crime2015 = pd.read_json(url)
frames = [crime2015, crime2016, crime2017, crime2018, crime2019]
# combine each year together and remove Na
allcrime = pd.concat(frames).dropna()
cols = ["case_number",
"date",
"location",
"district",
"block",
"primary_type",
"description",
"location_description",
"community_area",
"latitude",
"longitude",
"ward",
"year",
"domestic",
"beat",
"arrest"
]
# Trim to these columns
allcrime = allcrime[cols]
# extract the year from date
allcrime["date"] = pd.to_datetime(allcrime["date"])
allcrime['year'] = pd.DatetimeIndex(allcrime['date']).year
allcrime['month'] = pd.DatetimeIndex(allcrime['date']).month
allcrime['hour'] = pd.DatetimeIndex(allcrime['date']).hour
allcrime['day'] = pd.DatetimeIndex(allcrime['date']).day
allcrime["dow"] = allcrime["date"].apply(lambda x: x.strftime('%A'))
allcrime['month'] = allcrime['month'].astype(str)
allcrime['year'] = allcrime['year'].astype(str)
allcrime["year_month"] = allcrime['year'].str.cat(allcrime['month'].str.zfill(2), sep="-")
# change data type
allcrime['community_area'] = allcrime['community_area'].astype(int)
allcrime['district'] = allcrime['district'].astype(int)
allcrime['ward'] = allcrime['ward'].astype(int)
allcrime['beat'] = allcrime['beat'].astype(int)
# creating a geometry column
geometry = [Point(xy) for xy in zip(allcrime['longitude'], allcrime['latitude'])]
# coordinate reference system : WGS84
crs = {'init': 'epsg:4326'}
# creating a Geographic data frame
allcrime = gpd.GeoDataFrame(allcrime, crs=crs, geometry=geometry)
/Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/pyproj/crs/crs.py:130: FutureWarning: '+init=<authority>:<code>' syntax is deprecated. '<authority>:<code>' is the preferred initialization method. When making the change, be mindful of axis order changes: https://pyproj4.github.io/pyproj/stable/gotchas.html#axis-order-changes-in-proj-6 in_crs_string = _prepare_from_proj_string(in_crs_string)
# add the count field
allcrime["count"] = 1
# data format transform
allcrime['primary_type'] = pd.Categorical(allcrime['primary_type'])
allcrime['description'] = pd.Categorical(allcrime['description'])
allcrime['location_description'] = pd.Categorical(allcrime['location_description'])
allcrime.head()
| case_number | date | location | district | block | primary_type | description | location_description | community_area | latitude | longitude | ward | year | domestic | beat | arrest | month | hour | day | dow | year_month | geometry | count | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | HZ100370 | 2015-12-31 23:59:00 | {'latitude': '41.757366519', 'human_address': ... | 6 | 075XX S EMERALD AVE | CRIMINAL DAMAGE | TO VEHICLE | STREET | 68 | 41.757367 | -87.642993 | 17 | 2015 | False | 621 | False | 12 | 23 | 31 | Thursday | 2015-12 | POINT (-87.64299 41.75737) | 1 |
| 2 | HZ100006 | 2015-12-31 23:55:00 | {'latitude': '41.751270452', 'human_address': ... | 4 | 079XX S STONY ISLAND AVE | BATTERY | AGGRAVATED: OTHER DANG WEAPON | STREET | 45 | 41.751270 | -87.585822 | 8 | 2015 | False | 411 | False | 12 | 23 | 31 | Thursday | 2015-12 | POINT (-87.58582 41.75127) | 1 |
| 3 | HZ100010 | 2015-12-31 23:50:00 | {'latitude': '42.016804165', 'human_address': ... | 24 | 024XX W FARGO AVE | THEFT | $500 AND UNDER | APARTMENT | 2 | 42.016804 | -87.690709 | 50 | 2015 | False | 2411 | False | 12 | 23 | 31 | Thursday | 2015-12 | POINT (-87.69071 42.01680) | 1 |
| 4 | HZ100002 | 2015-12-31 23:50:00 | {'latitude': '41.949837364', 'human_address': ... | 19 | 037XX N CLARK ST | BATTERY | SIMPLE | SIDEWALK | 6 | 41.949837 | -87.658635 | 44 | 2015 | False | 1923 | True | 12 | 23 | 31 | Thursday | 2015-12 | POINT (-87.65864 41.94984) | 1 |
| 5 | HZ102701 | 2015-12-31 23:45:00 | {'latitude': '41.910469677', 'human_address': ... | 25 | 050XX W CONCORD PL | CRIMINAL DAMAGE | TO PROPERTY | APARTMENT | 25 | 41.910470 | -87.751597 | 37 | 2015 | False | 2533 | False | 12 | 23 | 31 | Thursday | 2015-12 | POINT (-87.75160 41.91047) | 1 |
In this project, the crime dataset from 2015 to 2019 owns 1303648 rows and 23 columns, and also includes 34 types of crimes, which are multiple categorical values.
print('Dataset shape ->', allcrime.shape)
Dataset shape -> (1303648, 23)
crime = allcrime["primary_type"].unique()
list(crime)
['CRIMINAL DAMAGE', 'BATTERY', 'THEFT', 'OTHER OFFENSE', 'ROBBERY', 'MOTOR VEHICLE THEFT', 'WEAPONS VIOLATION', 'CRIMINAL SEXUAL ASSAULT', 'ASSAULT', 'BURGLARY', 'DECEPTIVE PRACTICE', 'CRIM SEXUAL ASSAULT', 'NARCOTICS', 'CRIMINAL TRESPASS', 'OFFENSE INVOLVING CHILDREN', 'INTERFERENCE WITH PUBLIC OFFICER', 'SEX OFFENSE', 'ARSON', 'PUBLIC PEACE VIOLATION', 'PROSTITUTION', 'LIQUOR LAW VIOLATION', 'HOMICIDE', 'INTIMIDATION', 'KIDNAPPING', 'OBSCENITY', 'STALKING', 'PUBLIC INDECENCY', 'NON-CRIMINAL', 'NON - CRIMINAL', 'CONCEALED CARRY LICENSE VIOLATION', 'GAMBLING', 'HUMAN TRAFFICKING', 'OTHER NARCOTIC VIOLATION', 'NON-CRIMINAL (SUBJECT SPECIFIED)']
The community dataset is from Chicago Data Portal. There are 77 communities in the Chicago City.
boundary = gpd.read_file('Boundaries - Community Areas (current).geojson')
boundary.head()
| community | area | shape_area | perimeter | area_num_1 | area_numbe | comarea_id | comarea | shape_len | geometry | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | DOUGLAS | 0 | 46004621.1581 | 0 | 35 | 35 | 0 | 0 | 31027.0545098 | MULTIPOLYGON (((-87.60914 41.84469, -87.60915 ... |
| 1 | OAKLAND | 0 | 16913961.0408 | 0 | 36 | 36 | 0 | 0 | 19565.5061533 | MULTIPOLYGON (((-87.59215 41.81693, -87.59231 ... |
| 2 | FULLER PARK | 0 | 19916704.8692 | 0 | 37 | 37 | 0 | 0 | 25339.0897503 | MULTIPOLYGON (((-87.62880 41.80189, -87.62879 ... |
| 3 | GRAND BOULEVARD | 0 | 48492503.1554 | 0 | 38 | 38 | 0 | 0 | 28196.8371573 | MULTIPOLYGON (((-87.60671 41.81681, -87.60670 ... |
| 4 | KENWOOD | 0 | 29071741.9283 | 0 | 39 | 39 | 0 | 0 | 23325.1679062 | MULTIPOLYGON (((-87.59215 41.81693, -87.59215 ... |
community = boundary[["community", "area_numbe", "geometry"]].rename(columns={'area_numbe': 'community_area'}).drop(columns='geometry')
community['community_area'] = community['community_area'].astype(int)
community.head()
| community | community_area | |
|---|---|---|
| 0 | DOUGLAS | 35 |
| 1 | OAKLAND | 36 |
| 2 | FULLER PARK | 37 |
| 3 | GRAND BOULEVARD | 38 |
| 4 | KENWOOD | 39 |
allcrime = allcrime.merge(community, on="community_area")
allcrime.head()
| case_number | date | location | district | block | primary_type | description | location_description | community_area | latitude | longitude | ward | year | domestic | beat | arrest | month | hour | day | dow | year_month | geometry | count | community | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | HZ100370 | 2015-12-31 23:59:00 | {'latitude': '41.757366519', 'human_address': ... | 6 | 075XX S EMERALD AVE | CRIMINAL DAMAGE | TO VEHICLE | STREET | 68 | 41.757367 | -87.642993 | 17 | 2015 | False | 621 | False | 12 | 23 | 31 | Thursday | 2015-12 | POINT (-87.64299 41.75737) | 1 | ENGLEWOOD |
| 1 | HZ100501 | 2015-12-31 22:00:00 | {'latitude': '41.766639191', 'human_address': ... | 7 | 070XX S PARNELL AVE | CRIMINAL DAMAGE | TO VEHICLE | STREET | 68 | 41.766639 | -87.638388 | 6 | 2015 | False | 732 | False | 12 | 22 | 31 | Thursday | 2015-12 | POINT (-87.63839 41.76664) | 1 | ENGLEWOOD |
| 2 | HZ100572 | 2015-12-31 21:00:00 | {'latitude': '41.759381999', 'human_address': ... | 7 | 074XX S GREEN ST | MOTOR VEHICLE THEFT | TRUCK, BUS, MOTOR HOME | STREET | 68 | 41.759382 | -87.645475 | 17 | 2015 | False | 733 | False | 12 | 21 | 31 | Thursday | 2015-12 | POINT (-87.64547 41.75938) | 1 | ENGLEWOOD |
| 3 | HY556424 | 2015-12-31 17:55:00 | {'latitude': '41.77529175', 'human_address': '... | 7 | 005XX W 65TH PL | ROBBERY | ARMED: HANDGUN | ALLEY | 68 | 41.775292 | -87.637448 | 20 | 2015 | False | 722 | False | 12 | 17 | 31 | Thursday | 2015-12 | POINT (-87.63745 41.77529) | 1 | ENGLEWOOD |
| 4 | HY556343 | 2015-12-31 16:49:00 | {'latitude': '41.76228091', 'human_address': '... | 7 | 072XX S ABERDEEN ST | BATTERY | AGGRAVATED: HANDGUN | DRIVEWAY - RESIDENTIAL | 68 | 41.762281 | -87.651616 | 6 | 2015 | False | 733 | False | 12 | 16 | 31 | Thursday | 2015-12 | POINT (-87.65162 41.76228) | 1 | ENGLEWOOD |
This dataset contains a selection of six socioeconomic indicators of public health significance and a “hardship index,” by Chicago community area, for the years 2008 – 2012.
The indicators are the percent of occupied housing units with more than one person per room (i.e., crowded housing); the percent of households living below the federal poverty level; the percent of persons in the labor force over the age of 16 years that are unemployed; the percent of persons over the age of 25 years without a high school diploma; the percent of the population under 18 or over 64 years of age (i.e., dependency); and per capita income.
url = "https://data.cityofchicago.org/resource/kn9c-c2s2.csv"
census = pd.read_csv(url)
census = census.rename(columns={'ca': 'community_area'})
census['community_area'] = census['community_area'].astype("Int64")
allcrime['community_area'] = allcrime['community_area'].astype("Int64")
allcrime = pd.merge(allcrime, census,on="community_area")
allcrime.head()
| case_number | date | location | district | block | primary_type | description | location_description | community_area | latitude | longitude | ward | year | domestic | beat | arrest | month | hour | day | dow | year_month | geometry | count | community | community_area_name | percent_of_housing_crowded | percent_households_below_poverty | percent_aged_16_unemployed | percent_aged_25_without_high_school_diploma | percent_aged_under_18_or_over_64 | per_capita_income_ | hardship_index | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | HZ100370 | 2015-12-31 23:59:00 | {'latitude': '41.757366519', 'human_address': ... | 6 | 075XX S EMERALD AVE | CRIMINAL DAMAGE | TO VEHICLE | STREET | 68 | 41.757367 | -87.642993 | 17 | 2015 | False | 621 | False | 12 | 23 | 31 | Thursday | 2015-12 | POINT (-87.64299 41.75737) | 1 | ENGLEWOOD | Englewood | 3.8 | 46.6 | 28.0 | 28.5 | 42.5 | 11888 | 94.0 |
| 1 | HZ100501 | 2015-12-31 22:00:00 | {'latitude': '41.766639191', 'human_address': ... | 7 | 070XX S PARNELL AVE | CRIMINAL DAMAGE | TO VEHICLE | STREET | 68 | 41.766639 | -87.638388 | 6 | 2015 | False | 732 | False | 12 | 22 | 31 | Thursday | 2015-12 | POINT (-87.63839 41.76664) | 1 | ENGLEWOOD | Englewood | 3.8 | 46.6 | 28.0 | 28.5 | 42.5 | 11888 | 94.0 |
| 2 | HZ100572 | 2015-12-31 21:00:00 | {'latitude': '41.759381999', 'human_address': ... | 7 | 074XX S GREEN ST | MOTOR VEHICLE THEFT | TRUCK, BUS, MOTOR HOME | STREET | 68 | 41.759382 | -87.645475 | 17 | 2015 | False | 733 | False | 12 | 21 | 31 | Thursday | 2015-12 | POINT (-87.64547 41.75938) | 1 | ENGLEWOOD | Englewood | 3.8 | 46.6 | 28.0 | 28.5 | 42.5 | 11888 | 94.0 |
| 3 | HY556424 | 2015-12-31 17:55:00 | {'latitude': '41.77529175', 'human_address': '... | 7 | 005XX W 65TH PL | ROBBERY | ARMED: HANDGUN | ALLEY | 68 | 41.775292 | -87.637448 | 20 | 2015 | False | 722 | False | 12 | 17 | 31 | Thursday | 2015-12 | POINT (-87.63745 41.77529) | 1 | ENGLEWOOD | Englewood | 3.8 | 46.6 | 28.0 | 28.5 | 42.5 | 11888 | 94.0 |
| 4 | HY556343 | 2015-12-31 16:49:00 | {'latitude': '41.76228091', 'human_address': '... | 7 | 072XX S ABERDEEN ST | BATTERY | AGGRAVATED: HANDGUN | DRIVEWAY - RESIDENTIAL | 68 | 41.762281 | -87.651616 | 6 | 2015 | False | 733 | False | 12 | 16 | 31 | Thursday | 2015-12 | POINT (-87.65162 41.76228) | 1 | ENGLEWOOD | Englewood | 3.8 | 46.6 | 28.0 | 28.5 | 42.5 | 11888 | 94.0 |
Finally, the data will be merged together. The above datasets are added to the original dataset.
# group by community
count=allcrime.groupby(["community"])["count"].sum()
# add arrest rate
arrest=allcrime.groupby(['community'])['arrest'].mean()
# add arrest domestic violence rate
domestic=allcrime.groupby(['community'])['domestic'].mean()
#types = allcrime.groupby(["community","primary_type"])["count"].sum().reset_index()
#types = pd.DataFrame(types).rename(columns={'count': 'crime_count'})
# transform dataframe from long to wide, add month column
month = allcrime.groupby(["community","month"])["count"].sum().reset_index()
month = pd.DataFrame(month).rename(columns={'count': 'month_count'})
days = allcrime.groupby(["community","dow"])["count"].sum().reset_index()
days = pd.DataFrame(days).rename(columns={'count': 'week_count'})
weapon_select=allcrime.groupby(["description","community"])['count'].sum().reset_index()
weapon_select=pd.DataFrame(weapon_select)
knife=weapon_select[weapon_select["description"].str.contains('KNIFE')]
gun=weapon_select[weapon_select["description"].str.contains('GUN')]
weapon=weapon_select[weapon_select["description"].str.contains('WEAPON')]
withweapon=pd.concat([knife, gun, weapon]).rename(columns={'count': 'weapon_count'})
withweapon=withweapon.groupby(["community"])['weapon_count'].sum().reset_index()
# merge together and get final cluster
crime=pd.merge(count,arrest,on="community")
crime=pd.merge(crime,domestic,on="community")
#crime=pd.merge(crime,types,on="community")
crime=pd.merge(crime,month,on="community")
crime=pd.merge(crime,days,on="community")
crime=pd.merge(crime,withweapon,on="community")
crime=crime.dropna()
crime
| community | count | arrest | domestic | month | month_count | dow | week_count | weapon_count | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | ALBANY PARK | 11908 | 0.130752 | 0.141250 | 1 | 934 | Friday | 1634 | 1116 |
| 1 | ALBANY PARK | 11908 | 0.130752 | 0.141250 | 1 | 934 | Monday | 1756 | 1116 |
| 2 | ALBANY PARK | 11908 | 0.130752 | 0.141250 | 1 | 934 | Saturday | 1673 | 1116 |
| 3 | ALBANY PARK | 11908 | 0.130752 | 0.141250 | 1 | 934 | Sunday | 1708 | 1116 |
| 4 | ALBANY PARK | 11908 | 0.130752 | 0.141250 | 1 | 934 | Thursday | 1704 | 1116 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 6463 | WOODLAWN | 18007 | 0.218137 | 0.222691 | 9 | 1493 | Saturday | 2455 | 2206 |
| 6464 | WOODLAWN | 18007 | 0.218137 | 0.222691 | 9 | 1493 | Sunday | 2468 | 2206 |
| 6465 | WOODLAWN | 18007 | 0.218137 | 0.222691 | 9 | 1493 | Thursday | 2633 | 2206 |
| 6466 | WOODLAWN | 18007 | 0.218137 | 0.222691 | 9 | 1493 | Tuesday | 2651 | 2206 |
| 6467 | WOODLAWN | 18007 | 0.218137 | 0.222691 | 9 | 1493 | Wednesday | 2553 | 2206 |
6468 rows × 9 columns
crime_geo=boundary.merge(crime,on="community")
crime_geo=crime_geo.drop(columns=['area', 'shape_area','perimeter','area_num_1','comarea_id','comarea','shape_len']).rename(columns={'area_numbe': 'community_area'})
crime_geo=crime_geo.to_crs(3857)
crime_geo['community_area'] = crime_geo['community_area'].astype("Int64")
crime_geo=census.merge(crime_geo,on="community_area")
crime_geo
| community_area | community_area_name | percent_of_housing_crowded | percent_households_below_poverty | percent_aged_16_unemployed | percent_aged_25_without_high_school_diploma | percent_aged_under_18_or_over_64 | per_capita_income_ | hardship_index | community | geometry | count | arrest | domestic | month | month_count | dow | week_count | weapon_count | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | Rogers Park | 7.7 | 23.6 | 8.7 | 18.2 | 27.5 | 23939 | 39.0 | ROGERS PARK | MULTIPOLYGON (((-9757660.529 5160704.746, -975... | 18144 | 0.171627 | 0.140267 | 1 | 1378 | Friday | 2583 | 1309 |
| 1 | 1 | Rogers Park | 7.7 | 23.6 | 8.7 | 18.2 | 27.5 | 23939 | 39.0 | ROGERS PARK | MULTIPOLYGON (((-9757660.529 5160704.746, -975... | 18144 | 0.171627 | 0.140267 | 1 | 1378 | Monday | 2744 | 1309 |
| 2 | 1 | Rogers Park | 7.7 | 23.6 | 8.7 | 18.2 | 27.5 | 23939 | 39.0 | ROGERS PARK | MULTIPOLYGON (((-9757660.529 5160704.746, -975... | 18144 | 0.171627 | 0.140267 | 1 | 1378 | Saturday | 2580 | 1309 |
| 3 | 1 | Rogers Park | 7.7 | 23.6 | 8.7 | 18.2 | 27.5 | 23939 | 39.0 | ROGERS PARK | MULTIPOLYGON (((-9757660.529 5160704.746, -975... | 18144 | 0.171627 | 0.140267 | 1 | 1378 | Sunday | 2549 | 1309 |
| 4 | 1 | Rogers Park | 7.7 | 23.6 | 8.7 | 18.2 | 27.5 | 23939 | 39.0 | ROGERS PARK | MULTIPOLYGON (((-9757660.529 5160704.746, -975... | 18144 | 0.171627 | 0.140267 | 1 | 1378 | Thursday | 2522 | 1309 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 6463 | 77 | Edgewater | 4.1 | 18.2 | 9.2 | 9.7 | 23.8 | 33385 | 19.0 | EDGEWATER | MULTIPOLYGON (((-9757660.529 5160704.746, -975... | 12266 | 0.159873 | 0.104598 | 9 | 1081 | Saturday | 1739 | 682 |
| 6464 | 77 | Edgewater | 4.1 | 18.2 | 9.2 | 9.7 | 23.8 | 33385 | 19.0 | EDGEWATER | MULTIPOLYGON (((-9757660.529 5160704.746, -975... | 12266 | 0.159873 | 0.104598 | 9 | 1081 | Sunday | 1591 | 682 |
| 6465 | 77 | Edgewater | 4.1 | 18.2 | 9.2 | 9.7 | 23.8 | 33385 | 19.0 | EDGEWATER | MULTIPOLYGON (((-9757660.529 5160704.746, -975... | 12266 | 0.159873 | 0.104598 | 9 | 1081 | Thursday | 1738 | 682 |
| 6466 | 77 | Edgewater | 4.1 | 18.2 | 9.2 | 9.7 | 23.8 | 33385 | 19.0 | EDGEWATER | MULTIPOLYGON (((-9757660.529 5160704.746, -975... | 12266 | 0.159873 | 0.104598 | 9 | 1081 | Tuesday | 1756 | 682 |
| 6467 | 77 | Edgewater | 4.1 | 18.2 | 9.2 | 9.7 | 23.8 | 33385 | 19.0 | EDGEWATER | MULTIPOLYGON (((-9757660.529 5160704.746, -975... | 12266 | 0.159873 | 0.104598 | 9 | 1081 | Wednesday | 1786 | 682 |
6468 rows × 19 columns
crime_geo=gpd.GeoDataFrame(crime_geo)
crime_geo=crime_geo.to_crs(3857)
crime_geo
| community_area | community_area_name | percent_of_housing_crowded | percent_households_below_poverty | percent_aged_16_unemployed | percent_aged_25_without_high_school_diploma | percent_aged_under_18_or_over_64 | per_capita_income_ | hardship_index | community | geometry | count | arrest | domestic | month | month_count | dow | week_count | weapon_count | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | Rogers Park | 7.7 | 23.6 | 8.7 | 18.2 | 27.5 | 23939 | 39.0 | ROGERS PARK | MULTIPOLYGON (((-9757660.529 5160704.746, -975... | 18144 | 0.171627 | 0.140267 | 1 | 1378 | Friday | 2583 | 1309 |
| 1 | 1 | Rogers Park | 7.7 | 23.6 | 8.7 | 18.2 | 27.5 | 23939 | 39.0 | ROGERS PARK | MULTIPOLYGON (((-9757660.529 5160704.746, -975... | 18144 | 0.171627 | 0.140267 | 1 | 1378 | Monday | 2744 | 1309 |
| 2 | 1 | Rogers Park | 7.7 | 23.6 | 8.7 | 18.2 | 27.5 | 23939 | 39.0 | ROGERS PARK | MULTIPOLYGON (((-9757660.529 5160704.746, -975... | 18144 | 0.171627 | 0.140267 | 1 | 1378 | Saturday | 2580 | 1309 |
| 3 | 1 | Rogers Park | 7.7 | 23.6 | 8.7 | 18.2 | 27.5 | 23939 | 39.0 | ROGERS PARK | MULTIPOLYGON (((-9757660.529 5160704.746, -975... | 18144 | 0.171627 | 0.140267 | 1 | 1378 | Sunday | 2549 | 1309 |
| 4 | 1 | Rogers Park | 7.7 | 23.6 | 8.7 | 18.2 | 27.5 | 23939 | 39.0 | ROGERS PARK | MULTIPOLYGON (((-9757660.529 5160704.746, -975... | 18144 | 0.171627 | 0.140267 | 1 | 1378 | Thursday | 2522 | 1309 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 6463 | 77 | Edgewater | 4.1 | 18.2 | 9.2 | 9.7 | 23.8 | 33385 | 19.0 | EDGEWATER | MULTIPOLYGON (((-9757660.529 5160704.746, -975... | 12266 | 0.159873 | 0.104598 | 9 | 1081 | Saturday | 1739 | 682 |
| 6464 | 77 | Edgewater | 4.1 | 18.2 | 9.2 | 9.7 | 23.8 | 33385 | 19.0 | EDGEWATER | MULTIPOLYGON (((-9757660.529 5160704.746, -975... | 12266 | 0.159873 | 0.104598 | 9 | 1081 | Sunday | 1591 | 682 |
| 6465 | 77 | Edgewater | 4.1 | 18.2 | 9.2 | 9.7 | 23.8 | 33385 | 19.0 | EDGEWATER | MULTIPOLYGON (((-9757660.529 5160704.746, -975... | 12266 | 0.159873 | 0.104598 | 9 | 1081 | Thursday | 1738 | 682 |
| 6466 | 77 | Edgewater | 4.1 | 18.2 | 9.2 | 9.7 | 23.8 | 33385 | 19.0 | EDGEWATER | MULTIPOLYGON (((-9757660.529 5160704.746, -975... | 12266 | 0.159873 | 0.104598 | 9 | 1081 | Tuesday | 1756 | 682 |
| 6467 | 77 | Edgewater | 4.1 | 18.2 | 9.2 | 9.7 | 23.8 | 33385 | 19.0 | EDGEWATER | MULTIPOLYGON (((-9757660.529 5160704.746, -975... | 12266 | 0.159873 | 0.104598 | 9 | 1081 | Wednesday | 1786 | 682 |
6468 rows × 19 columns
In this section, I will explore the raw data. Based on crime records from previous years, this section is designed to show the basic information about crime in Chicago, such as the type of crime (e.g., wound, fatality etc.), the number of different types of crime, the location of crime, and the time series when the crime occurred. I will create some interactive charts by using altair, seaborn, numpy,plotly, andfolium function. The main point in this part is to indicate how significant location and time have effects on random crime.
# define basemap in folium
import folium
from folium.plugins import HeatMap
#define
def get_neighborhood_style(feature):
"""Return a style dict."""
return {"color": "#556158",
"weight": 1.2}
def get_highlighted_style(feature):
"""Return a style dict when highlighting a feature."""
return {"weight": 3, "color": "orange"}
def get_style(feature):
"""
Given an input GeoJSON feature, return a style dict.
Notes
-----
The color in the style dict is determined by the
"index_normalized" column in the input "feature".
"""
# Get the data value from the feature
value = feature['properties']['index_normalized']
# Evaluate the color map
# NOTE: value must between 0 and 1
rgb_color = cmap(value) # this is an RGB tuple
# Convert to hex string
color = mcolors.rgb2hex(rgb_color)
# Return the style dictionary
return {'weight': 3,
'color': color,
'fillColor': color,
"fillOpacity": 1}
community_count = allcrime.groupby(["community"])["count"].sum()
community_count = pd.DataFrame(community_count).reset_index().sort_values('count', ascending=False)
community_count.head()
| community | count | |
|---|---|---|
| 5 | AUSTIN | 77134 |
| 47 | NEAR NORTH SIDE | 56461 |
| 41 | LOOP | 47802 |
| 49 | NEAR WEST SIDE | 44469 |
| 52 | NORTH LAWNDALE | 43908 |
# STEP 1: Initialize the map
m = folium.Map(
location=[41.8, -87.7],
tiles='stamentoner',
zoom_start=10.5
)
title_html = '''
<h3 align="center" style="font-size:24px"><b>Choropleth of Crimes in Community, Chicago</b></h3>
'''
# STEP 2: Add the community GeoJson to the map
folium.GeoJson(
boundary,
style_function=get_neighborhood_style,
tooltip=folium.GeoJsonTooltip(['community'])
).add_to(m)
# STEP 3: Add the community count to the map
folium.Choropleth(
geo_data=boundary,
data=community_count,
columns=['community','count'],
key_on='feature.properties.community',
fill_color='YlOrBr',
fill_opacity=0.7,
line_color="transparent",
line_weight=0,
highlight=True
).add_to(m)
m.get_root().html.add_child(folium.Element(title_html))
m.save("folium_choropleth_community_crimes.html")
m
By folium, I'm able to create an interactive map to simply show the accumulated crime events in the community over 5 years. As shown on the map, the crime incidents were concentrated in several neighborhoods in the city and several neighborhoods near the southside.
crime_count = allcrime.groupby(["primary_type"])["count"].sum()
crime_count = pd.DataFrame(crime_count).reset_index().sort_values('count', ascending=True)
# Create the histogram
fig = px.bar(crime_count,
x="count",
y="primary_type",
color="count",
color_continuous_scale='thermal',
template="plotly_dark",
title="Crime Count in Chicago (2015 - 2019)")
fig.update_layout(
xaxis=dict(
title='Count',
tickfont_size=12
),
yaxis=dict(
title='Crimes Type',
titlefont_size=12,
tickfont_size=6,
)
)
fig.write_html("chart_crime_count.html")
# Show the plot
fig.show()
The top ten crimes in terms of crime records were theft,battery, criminal damage, assult, other offense, deceptive practice, narcotics, burglary, moto vehicle theft and robbery.
location_count = allcrime.groupby(["location_description"])["count"].sum()
location_count = pd.DataFrame(location_count).reset_index().sort_values('count', ascending=False)
location_count = location_count.iloc[0:20]
# Create the histogram
fig = px.bar(location_count,
x="count",
y="location_description",
color="count",
color_continuous_scale='thermal',
template="plotly_dark",
title="Top 20 Location of Crimes in Chicago (2015 - 2019)")
fig.update_layout(
xaxis=dict(
title='Count',
tickfont_size=12
),
yaxis=dict(
title='Location',
titlefont_size=12,
tickfont_size=6,
)
)
fig.write_html("chart_crime_location.html")
# Show the plot
fig.show()
community_crime_count = allcrime.groupby(["primary_type","community"])["count"].sum()
community_crime_count = pd.DataFrame(community_crime_count).reset_index().sort_values('count', ascending=True)
community_crime_count.head()
| primary_type | community | count | |
|---|---|---|---|
| 1644 | NON-CRIMINAL (SUBJECT SPECIFIED) | GARFIELD RIDGE | 0 |
| 1653 | NON-CRIMINAL (SUBJECT SPECIFIED) | KENWOOD | 0 |
| 1652 | NON-CRIMINAL (SUBJECT SPECIFIED) | JEFFERSON PARK | 0 |
| 1651 | NON-CRIMINAL (SUBJECT SPECIFIED) | IRVING PARK | 0 |
| 1650 | NON-CRIMINAL (SUBJECT SPECIFIED) | HYDE PARK | 0 |
fig = px.scatter(community_crime_count,
x="count",
y="community",
size="count",
color="primary_type",
template="plotly_dark",
title="Crimes Happened in Community (2015 - 2019)",
marginal_x="histogram")
fig.update_layout(
xaxis=dict(
title='Count',
tickfont_size=12
),
yaxis=dict(
title='Community',
titlefont_size=12,
tickfont_size=6,
)
)
fig.write_html("chart_crimes_community.html")
fig.show()
district_count=allcrime.groupby(["district","primary_type"])["count"].sum()
district_count=pd.DataFrame(district_count).reset_index()
district_count.head()
| district | primary_type | count | |
|---|---|---|---|
| 0 | 1 | ARSON | 22 |
| 1 | 1 | ASSAULT | 3546 |
| 2 | 1 | BATTERY | 7699 |
| 3 | 1 | BURGLARY | 913 |
| 4 | 1 | CONCEALED CARRY LICENSE VIOLATION | 10 |
colormap1 = alt.Scale(domain=[0,5,25,125,250, 625,1250, 3125,6250, 12500, 15625],
range=['#461e5c','#72369d','#9d4edd','#b67be6','#cea7ee','#ffcf70','#ffc243','#ffad03','#f56a00','#bf5028'],
type='sqrt')
heatmap_d = alt.Chart(district_count).mark_rect().encode(
alt.X('district:O', axis=alt.Axis(title="District", ticks=False)),
alt.Y('primary_type:O',axis=alt.Axis(title="Crime", ticks=True)),
alt.Color('count:Q',scale=colormap1),
tooltip=['district:O', 'count:Q']
).properties(
width=450,
height=500,
title={
"text": ["Crimes By District (2015-2019)"],
"subtitle": ["in Chicago"]
}
).interactive()
heatmap_d.save('heatmap_district.html')
heatmap_d
crime_annual=allcrime.groupby(["year_month","year","month"])["count"].sum()
crime_annual=pd.DataFrame(crime_annual).reset_index()
fig = px.bar(
crime_annual,
x="year_month",
y="count",
hover_data=["month", "count"],
color="count",
height=300,
color_continuous_scale=px.colors.sequential.thermal,
template="plotly_dark"
)
fig.update_layout(
title_text='Histogram of Crimes By Month (2015 - 2019)',
xaxis=dict(
title='Month',
tickfont_size=12
),
yaxis=dict(
title='Community',
titlefont_size=12,
tickfont_size=6,
)
)
fig.write_html("chart_histogram_of_crimes_by_month.html")
fig.show()
community_annual=allcrime.groupby(["year_month","year","month","community"])["count"].sum()
community_annual=pd.DataFrame(community_annual).reset_index()
community_annual
| year_month | year | month | community | count | |
|---|---|---|---|---|---|
| 0 | 2015-01 | 2015 | 1 | ALBANY PARK | 179 |
| 1 | 2015-01 | 2015 | 1 | ARCHER HEIGHTS | 85 |
| 2 | 2015-01 | 2015 | 1 | ARMOUR SQUARE | 89 |
| 3 | 2015-01 | 2015 | 1 | ASHBURN | 197 |
| 4 | 2015-01 | 2015 | 1 | AUBURN GRESHAM | 616 |
| ... | ... | ... | ... | ... | ... |
| 4615 | 2019-12 | 2019 | 12 | WEST LAWN | 146 |
| 4616 | 2019-12 | 2019 | 12 | WEST PULLMAN | 309 |
| 4617 | 2019-12 | 2019 | 12 | WEST RIDGE | 284 |
| 4618 | 2019-12 | 2019 | 12 | WEST TOWN | 558 |
| 4619 | 2019-12 | 2019 | 12 | WOODLAWN | 251 |
4620 rows × 5 columns
colormap2 = alt.Scale(domain=[0,20,40,60,80,100,200,400,800,1200],
range=['#461e5c','#72369d','#9d4edd','#b67be6','#cea7ee','#ffcf70','#ffc243','#ffad03','#f56a00','#bf5028'],
type='sqrt')
heatmap_t = alt.Chart(community_annual).mark_rect().encode(
alt.X('year_month:O', axis=alt.Axis(title="Month", ticks=False)),
alt.Y('community:O',axis=alt.Axis(title="Community", ticks=True)),
alt.Color('count:Q',scale=colormap2),
tooltip=['year:O', 'count:Q']
).properties(
width=650,
height=700,
title={
"text": ["Crimes By Month in Community (2015-2019)"],
"subtitle": ["in Chicago"]
}
).interactive()
heatmap_t.save('heatmap_time.html')
heatmap_t
hour_count=allcrime.groupby(["hour","primary_type"])["count"].sum()
hour_count=pd.DataFrame(hour_count).reset_index().sort_values('count', ascending=False)
hour_count.head()
| hour | primary_type | count | |
|---|---|---|---|
| 440 | 12 | THEFT | 20729 |
| 644 | 18 | THEFT | 19984 |
| 542 | 15 | THEFT | 19897 |
| 610 | 17 | THEFT | 19855 |
| 508 | 14 | THEFT | 19411 |
colormap3 = alt.Scale(domain=[0,5,25,125,250,625,1250,3125,6250, 12500, 15625],
range=['#461e5c','#72369d','#9d4edd','#b67be6','#cea7ee','#ffcf70','#ffc243','#ffad03','#f56a00','#bf5028'],
type='sqrt')
heatmap_h = alt.Chart(hour_count).mark_rect().encode(
alt.X('hour:O', axis=alt.Axis(title="Time", ticks=False)),
alt.Y('primary_type:O',axis=alt.Axis(title="Crime", ticks=True)),
alt.Color('count:Q',scale=colormap3),
tooltip=['count:Q']
).properties(
width=450,
height=500,
title={
"text": ["Crimes By Hour (2015-2019)"],
"subtitle": ["in Chicago"]
}
).interactive()
heatmap_h.save('heatmap_hour.html')
heatmap_h
In this step, I get the average number of arrested perpetrators then calculate the arrest rate.
# groupby arrest then calculate the mean value
arrest = allcrime.groupby('primary_type')['arrest'].mean().reset_index()
# remove excess decimal noise by rounding and then multiply each value by 100 to get a percentage
arrest['arrest_rate'] = arrest['arrest'].round(4)*100
arrest = arrest.sort_values('arrest_rate', ascending=False)
arrest.tail(n=15)
| primary_type | arrest | arrest_rate | |
|---|---|---|---|
| 23 | OFFENSE INVOLVING CHILDREN | 0.151739 | 15.17 |
| 12 | HUMAN TRAFFICKING | 0.142857 | 14.29 |
| 31 | STALKING | 0.136364 | 13.64 |
| 7 | CRIMINAL SEXUAL ASSAULT | 0.125224 | 12.52 |
| 14 | INTIMIDATION | 0.108516 | 10.85 |
| 0 | ARSON | 0.104429 | 10.44 |
| 32 | THEFT | 0.103988 | 10.40 |
| 5 | CRIM SEXUAL ASSAULT | 0.091111 | 9.11 |
| 29 | ROBBERY | 0.084881 | 8.49 |
| 17 | MOTOR VEHICLE THEFT | 0.079040 | 7.90 |
| 15 | KIDNAPPING | 0.069793 | 6.98 |
| 6 | CRIMINAL DAMAGE | 0.060331 | 6.03 |
| 20 | NON-CRIMINAL | 0.059701 | 5.97 |
| 9 | DECEPTIVE PRACTICE | 0.053185 | 5.32 |
| 3 | BURGLARY | 0.052050 | 5.20 |
# Create the histogram
fig = px.bar(arrest,
x="arrest_rate",
y="primary_type",
color="arrest_rate",
color_continuous_scale='thermal',
template="plotly_dark",
title="Crimes Arrest Rate (2015 - 2019)")
fig.update_layout(
xaxis=dict(
title='Rate(%)',
tickfont_size=12
),
yaxis=dict(
title='Crimes Type',
titlefont_size=12,
tickfont_size=6,
)
)
# Show the plot
fig.write_html("chart_crime_arrest_rate.html")
fig.show()
# groupby domestic violence then calculate the mean value
domestic = allcrime.groupby('primary_type')['domestic'].mean().reset_index()
domestic = domestic.sort_values('domestic', ascending=False)
# get the percent of domestic violence
domestic['domestic_rate'] = domestic['domestic'].round(4)*100
domestic.head()
| primary_type | domestic | domestic_rate | |
|---|---|---|---|
| 21 | NON-CRIMINAL (SUBJECT SPECIFIED) | 0.833333 | 83.33 |
| 23 | OFFENSE INVOLVING CHILDREN | 0.537047 | 53.70 |
| 2 | BATTERY | 0.487777 | 48.78 |
| 31 | STALKING | 0.456710 | 45.67 |
| 25 | OTHER OFFENSE | 0.318453 | 31.85 |
# Create the histogram
fig = px.bar(domestic[0:20],
x="domestic_rate",
y="primary_type",
color="domestic_rate",
color_continuous_scale='thermal',
template="plotly_dark",
title="Top 20 Crimes Domestic Violence Rate (2015 - 2019)")
fig.update_layout(
xaxis=dict(
title='Rate(%)',
tickfont_size=12
),
yaxis=dict(
title='Crimes Type',
titlefont_size=12,
tickfont_size=6,
)
)
# Show the plot
fig.write_html("chart_domestic_violence_rate.html")
fig.show()
import geoplot as gplt
crimes = ['ROBBERY','THEFT','BURGLARY','WEAPONS VIOLATION','NARCOTICS','ASSAULT','HOMICIDE','SEX OFFENSE','KIDNAPPING']
cmap = sns.cubehelix_palette(light=1, as_cmap=True)
fig, ax = plt.subplots(3, 3, sharex=True, sharey=True, figsize=(12,12))
for i , crimes in enumerate(np.random.choice(crimes, size=9, replace=False)):
data = allcrime.loc[allcrime['primary_type'] == crimes ]
ax = fig.add_subplot(3, 3, i+1)
gplt.polyplot(boundary,
ax=ax,
edgecolor="grey",
linewidth=0.2,
facecolor="white")
gplt.kdeplot(data,
shade=True,
shade_lowest=False,
cmap=cmap,
alpha=0.5,
ax=ax)
ax.set_title(crimes)
plt.suptitle('Density of Different Crimes in Chicago')
plt.figure(facecolor='black')
plt.savefig('density_of_different_crimes',bbox_inches='tight')
plt.show()
/Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/geoplot/geoplot.py:885: ShapelyDeprecationWarning: Iteration over multi-part geometries is deprecated and will be removed in Shapely 2.0. Use the `geoms` property to access the constituent parts of a multi-part geometry. /Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/seaborn/distributions.py:1718: UserWarning: `shade_lowest` is now deprecated in favor of `thresh`. Setting `thresh=0.05`, but please update your code. /Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/geoplot/geoplot.py:885: ShapelyDeprecationWarning: Iteration over multi-part geometries is deprecated and will be removed in Shapely 2.0. Use the `geoms` property to access the constituent parts of a multi-part geometry. /Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/seaborn/distributions.py:1718: UserWarning: `shade_lowest` is now deprecated in favor of `thresh`. Setting `thresh=0.05`, but please update your code. /Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/geoplot/geoplot.py:885: ShapelyDeprecationWarning: Iteration over multi-part geometries is deprecated and will be removed in Shapely 2.0. Use the `geoms` property to access the constituent parts of a multi-part geometry. /Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/seaborn/distributions.py:1718: UserWarning: `shade_lowest` is now deprecated in favor of `thresh`. Setting `thresh=0.05`, but please update your code. /Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/geoplot/geoplot.py:885: ShapelyDeprecationWarning: Iteration over multi-part geometries is deprecated and will be removed in Shapely 2.0. Use the `geoms` property to access the constituent parts of a multi-part geometry. /Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/seaborn/distributions.py:1718: UserWarning: `shade_lowest` is now deprecated in favor of `thresh`. Setting `thresh=0.05`, but please update your code. /Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/geoplot/geoplot.py:885: ShapelyDeprecationWarning: Iteration over multi-part geometries is deprecated and will be removed in Shapely 2.0. Use the `geoms` property to access the constituent parts of a multi-part geometry. /Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/seaborn/distributions.py:1718: UserWarning: `shade_lowest` is now deprecated in favor of `thresh`. Setting `thresh=0.05`, but please update your code. /Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/geoplot/geoplot.py:885: ShapelyDeprecationWarning: Iteration over multi-part geometries is deprecated and will be removed in Shapely 2.0. Use the `geoms` property to access the constituent parts of a multi-part geometry. /Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/seaborn/distributions.py:1718: UserWarning: `shade_lowest` is now deprecated in favor of `thresh`. Setting `thresh=0.05`, but please update your code. /Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/geoplot/geoplot.py:885: ShapelyDeprecationWarning: Iteration over multi-part geometries is deprecated and will be removed in Shapely 2.0. Use the `geoms` property to access the constituent parts of a multi-part geometry. /Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/seaborn/distributions.py:1718: UserWarning: `shade_lowest` is now deprecated in favor of `thresh`. Setting `thresh=0.05`, but please update your code. /Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/geoplot/geoplot.py:885: ShapelyDeprecationWarning: Iteration over multi-part geometries is deprecated and will be removed in Shapely 2.0. Use the `geoms` property to access the constituent parts of a multi-part geometry. /Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/seaborn/distributions.py:1718: UserWarning: `shade_lowest` is now deprecated in favor of `thresh`. Setting `thresh=0.05`, but please update your code. /Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/geoplot/geoplot.py:885: ShapelyDeprecationWarning: Iteration over multi-part geometries is deprecated and will be removed in Shapely 2.0. Use the `geoms` property to access the constituent parts of a multi-part geometry. /Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/seaborn/distributions.py:1718: UserWarning: `shade_lowest` is now deprecated in favor of `thresh`. Setting `thresh=0.05`, but please update your code.
<Figure size 640x480 with 0 Axes>
# group by community
cluster1=allcrime.groupby(["community"])["count"].sum()
# add arrest rate
cluster2=allcrime.groupby(['community'])['arrest'].mean()
# add arrest domestic violence rate
cluster3=allcrime.groupby(['community'])['domestic'].mean()
cluster4 = allcrime.groupby(["community","primary_type"])["count"].sum().reset_index()
cluster4 = pd.DataFrame(cluster4).rename(columns={'count': 'crime_count'})
cluster4 =pd.pivot(
cluster4,
index=['community'],
columns = 'primary_type',
values = 'crime_count'
)
# transform dataframe from long to wide, add month column
cluster5 = allcrime.groupby(["community","month"])["count"].sum().reset_index()
cluster5 = pd.DataFrame(cluster5).rename(columns={'count': 'month_count'})
cluster5 =pd.pivot(
cluster5,
index=['community'],
columns = 'month',
values = 'month_count'
)
cluster6 = allcrime.groupby(["community","dow"])["count"].sum().reset_index()
cluster6 = pd.DataFrame(cluster6).rename(columns={'count': 'week_count'})
cluster6 =pd.pivot(
cluster6,
index=['community'],
columns = 'dow',
values = 'week_count'
)
cluster_select=allcrime.groupby(["description","community"])['count'].sum().reset_index()
cluster_select=pd.DataFrame(cluster_select)
knife=cluster_select[cluster_select["description"].str.contains('KNIFE')]
gun=cluster_select[cluster_select["description"].str.contains('GUN')]
weapon=cluster_select[cluster_select["description"].str.contains('WEAPON')]
cluster7=pd.concat([knife, gun, weapon]).rename(columns={'count': 'weapon_count'})
cluster7=cluster7.groupby(["community"])['weapon_count'].sum().reset_index()
# merge together and get final cluster
cluster=pd.merge(cluster1,cluster2,on="community")
cluster=pd.merge(cluster,cluster3,on="community")
cluster=pd.merge(cluster,cluster4,on="community")
cluster=pd.merge(cluster,cluster5,on="community")
cluster=pd.merge(cluster,cluster6,on="community")
cluster=pd.merge(cluster,cluster7,on="community")
cluster=cluster.dropna()
cluster.head()
| community | count | arrest | domestic | ARSON | ASSAULT | BATTERY | BURGLARY | CONCEALED CARRY LICENSE VIOLATION | CRIM SEXUAL ASSAULT | CRIMINAL DAMAGE | CRIMINAL SEXUAL ASSAULT | CRIMINAL TRESPASS | DECEPTIVE PRACTICE | GAMBLING | HOMICIDE | HUMAN TRAFFICKING | INTERFERENCE WITH PUBLIC OFFICER | INTIMIDATION | KIDNAPPING | LIQUOR LAW VIOLATION | MOTOR VEHICLE THEFT | NARCOTICS | NON - CRIMINAL | NON-CRIMINAL | NON-CRIMINAL (SUBJECT SPECIFIED) | OBSCENITY | OFFENSE INVOLVING CHILDREN | OTHER NARCOTIC VIOLATION | OTHER OFFENSE | PROSTITUTION | PUBLIC INDECENCY | PUBLIC PEACE VIOLATION | ROBBERY | SEX OFFENSE | STALKING | THEFT | WEAPONS VIOLATION | 1 | 10 | 11 | 12 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | Friday | Monday | Saturday | Sunday | Thursday | Tuesday | Wednesday | weapon_count | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | ALBANY PARK | 11908 | 0.130752 | 0.141250 | 18 | 757 | 2236 | 703 | 1 | 59 | 1640 | 11 | 227 | 620 | 2 | 14 | 1 | 15 | 9 | 8 | 11 | 698 | 280 | 0 | 2 | 0 | 2 | 104 | 0 | 700 | 11 | 1 | 56 | 587 | 80 | 6 | 2909 | 140 | 934 | 1062 | 922 | 979 | 849 | 985 | 977 | 1067 | 1047 | 1002 | 1052 | 1032 | 1634 | 1756 | 1673 | 1708 | 1704 | 1735 | 1698 | 1116 |
| 1 | ARCHER HEIGHTS | 4310 | 0.166357 | 0.137355 | 8 | 293 | 709 | 338 | 1 | 16 | 595 | 6 | 79 | 259 | 0 | 9 | 0 | 5 | 6 | 10 | 4 | 256 | 128 | 0 | 1 | 0 | 4 | 38 | 0 | 225 | 12 | 1 | 37 | 137 | 25 | 4 | 1048 | 56 | 350 | 365 | 370 | 336 | 371 | 351 | 328 | 400 | 359 | 337 | 375 | 368 | 663 | 627 | 610 | 524 | 578 | 669 | 639 | 299 |
| 2 | ARMOUR SQUARE | 5081 | 0.161582 | 0.101752 | 2 | 344 | 939 | 213 | 0 | 16 | 507 | 3 | 180 | 422 | 2 | 5 | 0 | 11 | 5 | 1 | 16 | 214 | 100 | 0 | 1 | 0 | 0 | 21 | 0 | 176 | 1 | 0 | 16 | 434 | 21 | 3 | 1361 | 67 | 392 | 422 | 429 | 399 | 307 | 362 | 369 | 457 | 477 | 509 | 492 | 466 | 783 | 707 | 757 | 767 | 681 | 682 | 704 | 653 |
| 3 | ASHBURN | 11614 | 0.126830 | 0.171948 | 21 | 905 | 2055 | 784 | 0 | 42 | 1669 | 9 | 246 | 791 | 10 | 30 | 0 | 20 | 12 | 15 | 5 | 786 | 254 | 3 | 0 | 0 | 12 | 122 | 1 | 882 | 0 | 0 | 71 | 464 | 40 | 12 | 2219 | 134 | 982 | 1011 | 1006 | 1013 | 786 | 893 | 911 | 972 | 997 | 995 | 1071 | 977 | 1775 | 1696 | 1523 | 1548 | 1710 | 1666 | 1696 | 970 |
| 4 | AUBURN GRESHAM | 37599 | 0.245485 | 0.240060 | 71 | 3454 | 8470 | 2016 | 22 | 190 | 4402 | 22 | 1016 | 1698 | 28 | 111 | 1 | 285 | 9 | 32 | 23 | 1328 | 2101 | 0 | 1 | 0 | 2 | 396 | 0 | 2883 | 61 | 1 | 275 | 1668 | 90 | 28 | 5727 | 1188 | 2888 | 3372 | 2916 | 2851 | 2627 | 2968 | 3085 | 3440 | 3247 | 3521 | 3438 | 3246 | 5463 | 5383 | 5309 | 5295 | 5382 | 5389 | 5378 | 4965 |
kmeans=KMeans(n_clusters=5)
scaled=scaler.fit_transform(cluster[
['count',
'arrest',
'domestic',
'count',
'arrest',
'domestic',
'weapon_count',
'Friday',
'Monday',
'Saturday',
'Sunday',
'Thursday',
'Tuesday',
'Wednesday']
])
scaled.mean(axis=0)
scaled.std(axis=0)
array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])
kmeans.fit(scaled)
cluster['label']=kmeans.labels_
kmeans.inertia_
244.55318408291294
cluster_geo=boundary.merge(cluster,on="community")
cluster_geo=cluster_geo.drop(columns=['area', 'shape_area','perimeter','area_num_1','comarea_id','comarea','shape_len']).rename(columns={'area_numbe': 'community_area'})
cluster_geo=cluster_geo.to_crs(3857)
cluster_geo['label']=cluster_geo['label'].fillna(-1)
cluster_geo.head()
| community | community_area | geometry | count | arrest | domestic | ARSON | ASSAULT | BATTERY | BURGLARY | CONCEALED CARRY LICENSE VIOLATION | CRIM SEXUAL ASSAULT | CRIMINAL DAMAGE | CRIMINAL SEXUAL ASSAULT | CRIMINAL TRESPASS | DECEPTIVE PRACTICE | GAMBLING | HOMICIDE | HUMAN TRAFFICKING | INTERFERENCE WITH PUBLIC OFFICER | INTIMIDATION | KIDNAPPING | LIQUOR LAW VIOLATION | MOTOR VEHICLE THEFT | NARCOTICS | NON - CRIMINAL | NON-CRIMINAL | NON-CRIMINAL (SUBJECT SPECIFIED) | OBSCENITY | OFFENSE INVOLVING CHILDREN | OTHER NARCOTIC VIOLATION | OTHER OFFENSE | PROSTITUTION | PUBLIC INDECENCY | PUBLIC PEACE VIOLATION | ROBBERY | SEX OFFENSE | STALKING | THEFT | WEAPONS VIOLATION | 1 | 10 | 11 | 12 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | Friday | Monday | Saturday | Sunday | Thursday | Tuesday | Wednesday | weapon_count | label | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | DOUGLAS | 35 | MULTIPOLYGON (((-9752604.951 5137743.450, -975... | 12994 | 0.192012 | 0.145375 | 5 | 1095 | 2658 | 262 | 3 | 64 | 1202 | 12 | 516 | 887 | 3 | 20 | 1 | 27 | 11 | 9 | 3 | 477 | 353 | 1 | 2 | 0 | 3 | 80 | 0 | 925 | 1 | 1 | 69 | 621 | 55 | 11 | 3481 | 136 | 1051 | 1242 | 1063 | 965 | 900 | 1057 | 1023 | 1100 | 1144 | 1151 | 1120 | 1178 | 1896 | 1981 | 1675 | 1626 | 1880 | 2008 | 1928 | 1108 | 0 |
| 1 | OAKLAND | 36 | MULTIPOLYGON (((-9750713.852 5133595.674, -975... | 3228 | 0.091388 | 0.243185 | 5 | 309 | 612 | 101 | 1 | 18 | 455 | 3 | 84 | 178 | 0 | 9 | 0 | 3 | 1 | 3 | 0 | 155 | 37 | 0 | 1 | 0 | 3 | 33 | 0 | 302 | 0 | 0 | 7 | 119 | 14 | 3 | 720 | 52 | 249 | 294 | 223 | 237 | 188 | 245 | 261 | 290 | 321 | 309 | 348 | 263 | 457 | 466 | 491 | 468 | 437 | 461 | 448 | 314 | 4 |
| 2 | FULLER PARK | 37 | MULTIPOLYGON (((-9754793.199 5131350.020, -975... | 4362 | 0.249427 | 0.149473 | 7 | 377 | 835 | 113 | 1 | 23 | 440 | 6 | 188 | 246 | 1 | 16 | 1 | 49 | 5 | 3 | 0 | 198 | 244 | 1 | 0 | 1 | 2 | 44 | 0 | 220 | 0 | 0 | 22 | 248 | 15 | 3 | 938 | 115 | 309 | 381 | 327 | 299 | 283 | 327 | 368 | 385 | 451 | 428 | 376 | 428 | 647 | 618 | 646 | 634 | 598 | 633 | 586 | 613 | 4 |
| 3 | GRAND BOULEVARD | 38 | MULTIPOLYGON (((-9752334.139 5133578.410, -975... | 16251 | 0.179927 | 0.207987 | 10 | 1329 | 3616 | 645 | 1 | 67 | 1909 | 13 | 322 | 856 | 4 | 41 | 2 | 72 | 5 | 10 | 13 | 772 | 671 | 0 | 0 | 0 | 9 | 132 | 0 | 1162 | 63 | 1 | 69 | 766 | 57 | 14 | 3356 | 264 | 1137 | 1402 | 1336 | 1239 | 1079 | 1270 | 1281 | 1470 | 1497 | 1537 | 1574 | 1429 | 2378 | 2339 | 2282 | 2192 | 2374 | 2335 | 2351 | 1765 | 4 |
| 4 | KENWOOD | 39 | MULTIPOLYGON (((-9750713.852 5133595.674, -975... | 7018 | 0.105443 | 0.188800 | 7 | 565 | 1281 | 289 | 0 | 41 | 830 | 10 | 184 | 528 | 0 | 14 | 0 | 6 | 7 | 3 | 0 | 323 | 103 | 0 | 1 | 0 | 4 | 44 | 0 | 523 | 0 | 0 | 20 | 367 | 33 | 4 | 1770 | 61 | 596 | 636 | 583 | 585 | 459 | 536 | 515 | 640 | 619 | 669 | 614 | 566 | 1048 | 1070 | 958 | 916 | 1018 | 1030 | 978 | 659 | 4 |
cluster_geo.groupby("label").size()
label 0 31 1 17 2 5 3 1 4 23 dtype: int64
cluster_geo.groupby("label")['count'].mean()
label 0 10043.032258 1 30559.000000 2 42966.600000 3 77134.000000 4 7862.782609 Name: count, dtype: float64
import h3
import libpysal
from cenpy import products
from tobler.util import h3fy
from tobler.area_weighted import area_interpolate
import contextily as ctx
from cenpy import products
cluster_geo_hex = h3fy(cluster_geo, resolution=8)
cluster_geo_hex = area_interpolate(source_df=cluster_geo, target_df=cluster_geo_hex, intensive_variables=['label'])
/Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/tobler/util/util.py:141: FutureWarning: The default dtype for empty Series will be 'object' instead of 'float64' in a future version. Specify a dtype explicitly to silence this warning. /Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/tobler/util/util.py:141: FutureWarning: The default dtype for empty Series will be 'object' instead of 'float64' in a future version. Specify a dtype explicitly to silence this warning.
cluster_geo['label'] = cluster_geo['label'].astype('category')
fig, axs = plt.subplots(1,2, figsize=(18,10))
cluster_geo.plot('label', cmap='magma', alpha=0.6, linewidth=0.6, edgecolor='white', ax=axs[0],legend=True)
cluster_geo_hex.plot('label', cmap='magma', alpha=0.6, linewidth=0.6, edgecolor='white', ax=axs[1], legend=True, legend_kwds=dict(loc='upper right'),scheme="Quantiles", k=5)
axs[0].set_title('Community Clusters')
axs[1].set_title('Hex Clusters')
for ax in axs:
ax.axis('off')
ctx.add_basemap(ax=ax, zoom='auto', source=ctx.providers.Stamen.Toner)
plt.suptitle('Spatial Cluster Analysis of Crime in Chicago', fontsize=20)
plt.savefig('spatial_cluster_analysis',bbox_inches='tight')
crime_3857 = crime_geo.to_crs(3857)
cluster_geo = cluster_geo.to_crs(3857)
def get_xy_from_geometry(df):
"""
Return a numpy array with two columns, where the
first holds the `x` geometry coordinate and the second
column holds the `y` geometry coordinate
Note: this works with both Point() and Polygon() objects.
"""
# NEW: use the centroid.x and centroid.y to support Polygon() and Point() geometries
x = df.geometry.centroid.x
y = df.geometry.centroid.y
return np.column_stack((x, y)) # stack as columns
# Extract x/y for sales
crimeXY = get_xy_from_geometry(crime_3857)
crime_polygon = get_xy_from_geometry(cluster_geo)
#========crime point=======
# Get the data
url = "https://data.cityofchicago.org/resource/g5wi-brhg.geojson"
deposit = gpd.read_file(url)
deposit = deposit[deposit.geometry.notnull()]
# Get the X/Y
depositXY = get_xy_from_geometry(deposit.to_crs(epsg=3857))
# Run the k nearest algorithm
nbrs = NearestNeighbors(n_neighbors=5)
nbrs.fit(depositXY)
depositDists, _ = nbrs.kneighbors(crimeXY)
# Add the new feature
crime_3857["logDistDeposit"] = np.log10(depositDists.mean(axis=1))
#========community polygon=======
# Get the X/Y
depositXY = get_xy_from_geometry(deposit.to_crs(epsg=3857))
# Run the k nearest algorithm
nbrs = NearestNeighbors(n_neighbors=5)
nbrs.fit(depositXY)
depositDists, _ = nbrs.kneighbors(crime_polygon)
# Add the new feature
cluster_geo["logDistDeposit"] = np.log10(depositDists.mean(axis=1))
#========crime point=======
# Get the data
url = "https://data.cityofchicago.org/resource/kc9i-wq85.geojson"
abbuilding = gpd.read_file(url)
abbuilding = abbuilding[abbuilding.geometry.notnull()]
# Get the X/Y
abbuildingXY = get_xy_from_geometry(abbuilding.to_crs(epsg=3857))
# Run the k nearest algorithm
nbrs = NearestNeighbors(n_neighbors=5)
nbrs.fit(abbuildingXY)
abbuildingDists, _ = nbrs.kneighbors(crimeXY)
# Add the new feature
crime_3857["logDistAbbuilding"] = np.log10(abbuildingDists.mean(axis=1))
#========community polygon=======
# Get the X/Y
abbuildingXY = get_xy_from_geometry(abbuilding.to_crs(epsg=3857))
# Run the k nearest algorithm
nbrs = NearestNeighbors(n_neighbors=5)
nbrs.fit(abbuildingXY)
abbuildingDists, _ = nbrs.kneighbors(crime_polygon)
# Add the new feature
cluster_geo["logDistAbbuilding"] = np.log10(abbuildingDists.mean(axis=1))
#========crime point=======
# Get the data
url = "https://data.cityofchicago.org/resource/tz49-n8ze.geojson"
school = gpd.read_file(url)
school = school[school.geometry.notnull()]
# Get the X/Y
schoolXY = get_xy_from_geometry(school.to_crs(epsg=3857))
# Run the k nearest algorithm
nbrs = NearestNeighbors(n_neighbors=3)
nbrs.fit(schoolXY)
schoolDists, _ = nbrs.kneighbors(crimeXY)
# Add the new feature
crime_3857["logDistSchool"] = np.log10(schoolDists.mean(axis=1))
#========community polygon=======
# Get the X/Y
schoolXY = get_xy_from_geometry(school.to_crs(epsg=3857))
# Run the k nearest algorithm
nbrs = NearestNeighbors(n_neighbors=3)
nbrs.fit(schoolXY)
schoolDists, _ = nbrs.kneighbors(crime_polygon)
# Add the new feature
cluster_geo["logDistSchool"] = np.log10(schoolDists.mean(axis=1))
#========crime point=======
# Get the data
url = "https://data.cityofchicago.org/resource/53t8-wyrc.geojson"
grocery = gpd.read_file(url)
grocery = grocery[grocery.geometry.notnull()]
# Get the X/Y
groceryXY = get_xy_from_geometry(grocery.to_crs(epsg=3857))
# Run the k nearest algorithm
nbrs = NearestNeighbors(n_neighbors=3)
nbrs.fit(groceryXY)
groceryDists, _ = nbrs.kneighbors(crimeXY)
# Add the new feature
crime_3857["logDistGrocery"] = np.log10(groceryDists.mean(axis=1))
#========community polygon=======
# Get the X/Y
groceryXY = get_xy_from_geometry(grocery.to_crs(epsg=3857))
# Run the k nearest algorithm
nbrs = NearestNeighbors(n_neighbors=3)
nbrs.fit(groceryXY)
groceryDists, _ = nbrs.kneighbors(crime_polygon)
# Add the new feature
cluster_geo["logDistGrocery"] = np.log10(groceryDists.mean(axis=1))
chicago_outline = boundary.geometry.unary_union
chicago_outline
# Get the subway stops within the city limits
subway = ox.geometries_from_polygon(chicago_outline, tags={"station": "subway"})
# Convert to 3857 (meters)
subway = subway.to_crs(epsg=3857)
#========crime point=======
# STEP 1: x/y coordinates of subway stops (in EPGS=3857)
subwayXY = get_xy_from_geometry(subway.to_crs(epsg=3857))
# STEP 2: Initialize the algorithm, use 𝑘=5 to get the distance to the nearest stop.
nbrs = NearestNeighbors(n_neighbors=5)
# STEP 3: Fit the algorithm on the "neighbors" dataset
nbrs.fit(subwayXY)
# STEP 4: Get distances for sale to neighbors
subwayDists, subwayIndices = nbrs.kneighbors(crimeXY)
# STEP 5: add back to the original dataset
crime_3857["logDistSubway"] = np.log10(subwayDists.mean(axis=1))
#========community polygon=======
# STEP 1: x/y coordinates of subway stops (in EPGS=3857)
subwayXY = get_xy_from_geometry(subway.to_crs(epsg=3857))
# STEP 2: Initialize the algorithm, use 𝑘=5 to get the distance to the nearest stop.
nbrs = NearestNeighbors(n_neighbors=5)
# STEP 3: Fit the algorithm on the "neighbors" dataset
nbrs.fit(subwayXY)
# STEP 4: Get distances for sale to neighbors
subwayDists, subwayIndices = nbrs.kneighbors(crime_polygon)
# STEP 5: add back to the original dataset
cluster_geo["logDistSubway"] = np.log10(subwayDists.mean(axis=1))
#========crime point=======
# Get the subway stops within the city limits
retail = ox.geometries_from_polygon(chicago_outline, tags = {'landuse':['retail','commercial']})
# Convert to 3857 (meters)
retail = retail.to_crs(epsg=3857)
# Get the X/Y
retailXY = get_xy_from_geometry(retail.to_crs(epsg=3857))
# Run the k nearest algorithm
nbrs = NearestNeighbors(n_neighbors=5)
nbrs.fit(retailXY)
retailDists, _ = nbrs.kneighbors(crimeXY)
# Add the new feature
crime_3857["logDistRetail"] = np.log10(retailDists.mean(axis=1))
#========community polygon=======
# Get the X/Y
retailXY = get_xy_from_geometry(retail.to_crs(epsg=3857))
# Run the k nearest algorithm
nbrs = NearestNeighbors(n_neighbors=5)
nbrs.fit(retailXY)
retailDists, _ = nbrs.kneighbors(crime_polygon)
# Add the new feature
cluster_geo["logDistRetail"] = np.log10(retailDists.mean(axis=1))
#========crime point=======
# Get the data
url = "https://data.cityofchicago.org/resource/tdab-kixi.geojson"
landmark = gpd.read_file(url)
#landmark = landmark[landmark.geometry.notnull()]
# Get the X/Y
landmarkXY = get_xy_from_geometry(landmark.to_crs(epsg=3857))
# Run the k nearest algorithm
nbrs = NearestNeighbors(n_neighbors=2)
nbrs.fit(landmarkXY)
landmarkDists, _ = nbrs.kneighbors(crimeXY)
# Add the new feature
crime_3857["logDistLandmark"] = np.log10(landmarkDists.mean(axis=1))
#========community polygon=======
# Get the X/Y
landmarkXY = get_xy_from_geometry(landmark.to_crs(epsg=3857))
# Run the k nearest algorithm
nbrs = NearestNeighbors(n_neighbors=2)
nbrs.fit(landmarkXY)
landmarkDists, _ = nbrs.kneighbors(crime_polygon)
# Add the new feature
cluster_geo["logDistLandmark"] = np.log10(landmarkDists.mean(axis=1))
#========crime point=======
# Get the data
url = "https://data.cityofchicago.org/resource/3h7q-7mdb.geojson"
alerts = gpd.read_file(url)
alerts = alerts[alerts.geometry.notnull()]
# Get the X/Y
alertsXY = get_xy_from_geometry(alerts.to_crs(epsg=3857))
# Run the k nearest algorithm
nbrs = NearestNeighbors(n_neighbors=3)
nbrs.fit(alertsXY)
alertsDists, _ = nbrs.kneighbors(crimeXY)
# Add the new feature
crime_3857["logDistAlerts"] = np.log10(alertsDists.mean(axis=1))
#========community polygon=======
# Get the X/Y
alertsXY = get_xy_from_geometry(alerts.to_crs(epsg=3857))
# Run the k nearest algorithm
nbrs = NearestNeighbors(n_neighbors=2)
nbrs.fit(alertsXY)
alertsDists, _ = nbrs.kneighbors(crime_polygon)
# Add the new feature
cluster_geo["logDistAlerts"] = np.log10(alertsDists.mean(axis=1))
cluster_geo_hex = h3fy(cluster_geo, resolution=8)
/Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/tobler/util/util.py:141: FutureWarning: The default dtype for empty Series will be 'object' instead of 'float64' in a future version. Specify a dtype explicitly to silence this warning. /Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/tobler/util/util.py:141: FutureWarning: The default dtype for empty Series will be 'object' instead of 'float64' in a future version. Specify a dtype explicitly to silence this warning.
# Plot all features in the same figure
fig, axs = plt.subplots(ncols=2,nrows=4, figsize=(16,28))
# logDistDeposit
ax=axs[0][0]
logDistDeposit_hex = area_interpolate(source_df=cluster_geo, target_df=cluster_geo_hex, intensive_variables=['logDistDeposit'])
logDistDeposit_hex = logDistDeposit_hex.to_crs(3857)
logDistDeposit_hex.plot('logDistDeposit', cmap='magma_r', alpha=0.6, linewidth=0.4, edgecolor="white", ax=ax, legend=True)
ctx.add_basemap(ax=ax, zoom='auto', source=ctx.providers.Stamen.TonerBackground)
ax.axis('off')
ax.set_title('Nearest Neighbors - Log Distance of Despository Locations',fontsize=12)
# logDistAbbuilding
ax=axs[0][1]
logDistAbbuilding_hex = area_interpolate(source_df=cluster_geo, target_df=cluster_geo_hex, intensive_variables=['logDistAbbuilding'])
logDistAbbuilding_hex = logDistAbbuilding_hex.to_crs(3857)
logDistAbbuilding_hex.plot('logDistAbbuilding', cmap='magma_r', alpha=0.6, linewidth=0.4, edgecolor="white", ax=ax, legend=True)
ctx.add_basemap(ax=ax, zoom='auto', source=ctx.providers.Stamen.TonerBackground)
ax.axis('off')
ax.set_title('Nearest Neighbors - Log Distance of Abandon Buildings',fontsize=12)
# logDistSchool
ax=axs[1][0]
logDistSchool_hex = area_interpolate(source_df=cluster_geo, target_df=cluster_geo_hex, intensive_variables=['logDistSchool'])
logDistSchool_hex = logDistSchool_hex.to_crs(3857)
logDistSchool_hex.plot('logDistSchool', cmap='magma_r', alpha=0.6, linewidth=0.4, edgecolor="white", ax=ax, legend=True)
ctx.add_basemap(ax=ax, zoom='auto', source=ctx.providers.Stamen.TonerBackground)
ax.axis('off')
ax.set_title('Nearest Neighbors - Log Distance of Schools',fontsize=12)
# logDistGrocery
ax=axs[1][1]
logDistGrocery_hex = area_interpolate(source_df=cluster_geo, target_df=cluster_geo_hex, intensive_variables=['logDistGrocery'])
logDistGrocery_hex = logDistGrocery_hex.to_crs(3857)
logDistGrocery_hex.plot('logDistGrocery', cmap='magma_r', alpha=0.6, linewidth=0.4, edgecolor="white", ax=ax, legend=True)
ctx.add_basemap(ax=ax, zoom='auto', source=ctx.providers.Stamen.TonerBackground)
ax.axis('off')
ax.set_title('Nearest Neighbors - Log Distance of Grocery Stores',fontsize=12)
# logDistSubway
ax=axs[2][0]
logDistSubway_hex = area_interpolate(source_df=cluster_geo, target_df=cluster_geo_hex, intensive_variables=['logDistSubway'])
logDistSubway_hex = logDistSubway_hex.to_crs(3857)
logDistSubway_hex.plot('logDistSubway', cmap='magma_r', alpha=0.6, linewidth=0.4, edgecolor="white", ax=ax, legend=True)
ctx.add_basemap(ax=ax, zoom='auto', source=ctx.providers.Stamen.TonerBackground)
ax.axis('off')
ax.set_title('Nearest Neighbors - Log Distance of Subway Stations',fontsize=12)
# logDistRetail
ax=axs[2][1]
logDistRetail_hex = area_interpolate(source_df=cluster_geo, target_df=cluster_geo_hex, intensive_variables=['logDistRetail'])
logDistRetail_hex = logDistRetail_hex.to_crs(3857)
logDistRetail_hex.plot('logDistRetail', cmap='magma_r', alpha=0.6, linewidth=0.4, edgecolor="white", ax=ax, legend=True)
ctx.add_basemap(ax=ax, zoom='auto', source=ctx.providers.Stamen.TonerBackground)
ax.axis('off')
ax.set_title('Nearest Neighbors - Log Distance of Retails and Commercials',fontsize=12)
# logDistLandmark
ax=axs[3][0]
logDistLandmark_hex = area_interpolate(source_df=cluster_geo, target_df=cluster_geo_hex, intensive_variables=['logDistLandmark'])
logDistLandmark_hex = logDistLandmark_hex.to_crs(3857)
logDistLandmark_hex.plot('logDistLandmark', cmap='magma_r', alpha=0.6, linewidth=0.4, edgecolor="white", ax=ax, legend=True)
ctx.add_basemap(ax=ax, zoom='auto', source=ctx.providers.Stamen.TonerBackground)
ax.axis('off')
ax.set_title('Nearest Neighbors - Log Distance of Landmarks',fontsize=12)
# logDistAlerts
ax=axs[3][1]
logDistAlerts_hex = area_interpolate(source_df=cluster_geo, target_df=cluster_geo_hex, intensive_variables=['logDistAlerts'])
logDistAlerts_hex = logDistAlerts_hex.to_crs(3857)
logDistAlerts_hex.plot('logDistAlerts', cmap='magma_r', alpha=0.6, linewidth=0.4, edgecolor="white", ax=ax, legend=True)
ctx.add_basemap(ax=ax, zoom='auto', source=ctx.providers.Stamen.TonerBackground)
ax.axis('off')
ax.set_title('Nearest Neighbors - Log Distance of Shotspotter Alerts',fontsize=12)
plt.savefig('hexgrids_nn',bbox_inches='tight')
/Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/geopandas/plotting.py:74: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. /Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/setuptools/_distutils/version.py:346: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. /Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/geopandas/plotting.py:74: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. /Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/setuptools/_distutils/version.py:346: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. /Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/geopandas/plotting.py:74: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. /Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/setuptools/_distutils/version.py:346: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. /Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/geopandas/plotting.py:74: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. /Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/setuptools/_distutils/version.py:346: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. /Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/geopandas/plotting.py:74: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. /Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/setuptools/_distutils/version.py:346: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. /Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/geopandas/plotting.py:74: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. /Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/setuptools/_distutils/version.py:346: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. /Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/geopandas/plotting.py:74: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. /Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/setuptools/_distutils/version.py:346: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. /Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/geopandas/plotting.py:74: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. /Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/setuptools/_distutils/version.py:346: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
cluster_geo['community_area'] = cluster_geo['community_area'].astype("Int64")
cluster_geo = pd.merge(cluster_geo, census, on = "community_area")
# The overall crime rate of a county is calculated by dividing the total number of Index crimes submitted by police agencies in each county by the county’s population
# Then multiply the result by 100,000.
# The U.S. Census Bureau is the source of county population data.
cluster_geo['count'] = cluster_geo['count'].astype("Int64")
cluster_geo['crime_index'] = cluster_geo['count'] / 2740000 * 100000
cluster_geo['people_risk_index'] = cluster_geo['count'] / (cluster_geo['percent_aged_under_18_or_over_64'] * 2740000) * 100000
cluster_geo['weapon_rate'] = cluster_geo['weapon_count']/ cluster_geo['count']
cluster_geo['weapon_rate'] = cluster_geo['weapon_rate'].apply(pd.to_numeric)
cluster_geo['crime_index'] = cluster_geo['crime_index'].apply(pd.to_numeric)
cluster_geo['people_risk_index'] = cluster_geo['people_risk_index'].apply(pd.to_numeric)
cluster_geo.head()
| community | community_area | geometry | count | arrest | domestic | ARSON | ASSAULT | BATTERY | BURGLARY | CONCEALED CARRY LICENSE VIOLATION | CRIM SEXUAL ASSAULT | CRIMINAL DAMAGE | CRIMINAL SEXUAL ASSAULT | CRIMINAL TRESPASS | DECEPTIVE PRACTICE | GAMBLING | HOMICIDE | HUMAN TRAFFICKING | INTERFERENCE WITH PUBLIC OFFICER | INTIMIDATION | KIDNAPPING | LIQUOR LAW VIOLATION | MOTOR VEHICLE THEFT | NARCOTICS | NON - CRIMINAL | NON-CRIMINAL | NON-CRIMINAL (SUBJECT SPECIFIED) | OBSCENITY | OFFENSE INVOLVING CHILDREN | OTHER NARCOTIC VIOLATION | OTHER OFFENSE | PROSTITUTION | PUBLIC INDECENCY | PUBLIC PEACE VIOLATION | ROBBERY | SEX OFFENSE | STALKING | THEFT | WEAPONS VIOLATION | 1 | 10 | 11 | 12 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | Friday | Monday | Saturday | Sunday | Thursday | Tuesday | Wednesday | weapon_count | label | logDistDeposit | logDistAbbuilding | logDistSchool | logDistGrocery | logDistSubway | logDistRetail | logDistLandmark | logDistAlerts | community_area_name | percent_of_housing_crowded | percent_households_below_poverty | percent_aged_16_unemployed | percent_aged_25_without_high_school_diploma | percent_aged_under_18_or_over_64 | per_capita_income_ | hardship_index | crime_index | people_risk_index | weapon_rate | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | DOUGLAS | 35 | MULTIPOLYGON (((-9752604.951 5137743.450, -975... | 12994 | 0.192012 | 0.145375 | 5 | 1095 | 2658 | 262 | 3 | 64 | 1202 | 12 | 516 | 887 | 3 | 20 | 1 | 27 | 11 | 9 | 3 | 477 | 353 | 1 | 2 | 0 | 3 | 80 | 0 | 925 | 1 | 1 | 69 | 621 | 55 | 11 | 3481 | 136 | 1051 | 1242 | 1063 | 965 | 900 | 1057 | 1023 | 1100 | 1144 | 1151 | 1120 | 1178 | 1896 | 1981 | 1675 | 1626 | 1880 | 2008 | 1928 | 1108 | 0 | 3.103828 | 3.518327 | 2.745583 | 2.960215 | 3.293262 | 2.758428 | 2.391839 | 3.068095 | Douglas | 1.8 | 29.6 | 18.2 | 14.3 | 30.7 | 23791 | 47.0 | 474.233577 | 15.447348 | 0.085270 |
| 1 | OAKLAND | 36 | MULTIPOLYGON (((-9750713.852 5133595.674, -975... | 3228 | 0.091388 | 0.243185 | 5 | 309 | 612 | 101 | 1 | 18 | 455 | 3 | 84 | 178 | 0 | 9 | 0 | 3 | 1 | 3 | 0 | 155 | 37 | 0 | 1 | 0 | 3 | 33 | 0 | 302 | 0 | 0 | 7 | 119 | 14 | 3 | 720 | 52 | 249 | 294 | 223 | 237 | 188 | 245 | 261 | 290 | 321 | 309 | 348 | 263 | 457 | 466 | 491 | 468 | 437 | 461 | 448 | 314 | 4 | 3.310177 | 3.500816 | 3.025448 | 3.136585 | 3.413476 | 2.857741 | 2.959350 | 2.940911 | Oakland | 1.3 | 39.7 | 28.7 | 18.4 | 40.4 | 19252 | 78.0 | 117.810219 | 2.916095 | 0.097274 |
| 2 | FULLER PARK | 37 | MULTIPOLYGON (((-9754793.199 5131350.020, -975... | 4362 | 0.249427 | 0.149473 | 7 | 377 | 835 | 113 | 1 | 23 | 440 | 6 | 188 | 246 | 1 | 16 | 1 | 49 | 5 | 3 | 0 | 198 | 244 | 1 | 0 | 1 | 2 | 44 | 0 | 220 | 0 | 0 | 22 | 248 | 15 | 3 | 938 | 115 | 309 | 381 | 327 | 299 | 283 | 327 | 368 | 385 | 451 | 428 | 376 | 428 | 647 | 618 | 646 | 634 | 598 | 633 | 586 | 613 | 4 | 3.478687 | 2.817223 | 2.926504 | 3.185645 | 3.185514 | 2.244567 | 3.102937 | 2.718943 | Fuller Park | 3.2 | 51.2 | 33.9 | 26.6 | 44.9 | 10432 | 97.0 | 159.197080 | 3.545592 | 0.140532 |
| 3 | GRAND BOULEVARD | 38 | MULTIPOLYGON (((-9752334.139 5133578.410, -975... | 16251 | 0.179927 | 0.207987 | 10 | 1329 | 3616 | 645 | 1 | 67 | 1909 | 13 | 322 | 856 | 4 | 41 | 2 | 72 | 5 | 10 | 13 | 772 | 671 | 0 | 0 | 0 | 9 | 132 | 0 | 1162 | 63 | 1 | 69 | 766 | 57 | 14 | 3356 | 264 | 1137 | 1402 | 1336 | 1239 | 1079 | 1270 | 1281 | 1470 | 1497 | 1537 | 1574 | 1429 | 2378 | 2339 | 2282 | 2192 | 2374 | 2335 | 2351 | 1765 | 4 | 3.241109 | 3.121504 | 2.892320 | 2.805637 | 3.054997 | 2.562950 | 2.731433 | 2.856737 | Grand Boulevard | 3.3 | 29.3 | 24.3 | 15.9 | 39.5 | 23472 | 57.0 | 593.102190 | 15.015245 | 0.108609 |
| 4 | KENWOOD | 39 | MULTIPOLYGON (((-9750713.852 5133595.674, -975... | 7018 | 0.105443 | 0.188800 | 7 | 565 | 1281 | 289 | 0 | 41 | 830 | 10 | 184 | 528 | 0 | 14 | 0 | 6 | 7 | 3 | 0 | 323 | 103 | 0 | 1 | 0 | 4 | 44 | 0 | 523 | 0 | 0 | 20 | 367 | 33 | 4 | 1770 | 61 | 596 | 636 | 583 | 585 | 459 | 536 | 515 | 640 | 619 | 669 | 614 | 566 | 1048 | 1070 | 958 | 916 | 1018 | 1030 | 978 | 659 | 4 | 2.637666 | 3.548489 | 2.595956 | 2.929183 | 3.466788 | 2.487147 | 2.820633 | 2.996621 | Kenwood | 2.4 | 21.7 | 15.7 | 11.3 | 35.4 | 35911 | 26.0 | 256.131387 | 7.235350 | 0.093901 |
cluster_geo_hex = h3fy(cluster_geo, resolution=8)
/Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/tobler/util/util.py:141: FutureWarning: The default dtype for empty Series will be 'object' instead of 'float64' in a future version. Specify a dtype explicitly to silence this warning. /Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/tobler/util/util.py:141: FutureWarning: The default dtype for empty Series will be 'object' instead of 'float64' in a future version. Specify a dtype explicitly to silence this warning.
# Plot all features in the same figure
fig, axs = plt.subplots(ncols=2,nrows=3, figsize=(12,18))
# percent_of_housing_crowded
ax=axs[0][0]
percent_of_housing_crowded_hex = area_interpolate(source_df=cluster_geo, target_df=cluster_geo_hex, intensive_variables=['crime_index'])
percent_of_housing_crowded_hex = percent_of_housing_crowded_hex.to_crs(3857)
percent_of_housing_crowded_hex.plot('crime_index', cmap='magma_r', alpha=0.6, linewidth=0.4, edgecolor="white", ax=ax, legend=True)
ctx.add_basemap(ax=ax, zoom='auto', source=ctx.providers.Stamen.TonerBackground)
ax.axis('off')
ax.set_title('Crime Rate',fontsize=12)
# percent_households_below_poverty
ax=axs[0][1]
percent_households_below_poverty_hex = area_interpolate(source_df=cluster_geo, target_df=cluster_geo_hex, intensive_variables=['people_risk_index'])
percent_households_below_poverty_hex = percent_households_below_poverty_hex.to_crs(3857)
percent_households_below_poverty_hex.plot('people_risk_index', cmap='magma_r', alpha=0.6, linewidth=0.4, edgecolor="white", ax=ax, legend=True)
ctx.add_basemap(ax=ax, zoom='auto', source=ctx.providers.Stamen.TonerBackground)
ax.axis('off')
ax.set_title('Risk Rate of People (under 18 or over 64)',fontsize=12)
# per_capita_income_
ax=axs[1][0]
percent_of_housing_crowded_hex = area_interpolate(source_df=cluster_geo, target_df=cluster_geo_hex, intensive_variables=['percent_of_housing_crowded'])
percent_of_housing_crowded_hex = percent_of_housing_crowded_hex.to_crs(3857)
percent_of_housing_crowded_hex.plot('percent_of_housing_crowded', cmap='magma_r', alpha=0.6, linewidth=0.4, edgecolor="white", ax=ax, legend=True)
ctx.add_basemap(ax=ax, zoom='auto', source=ctx.providers.Stamen.TonerBackground)
ax.axis('off')
ax.set_title('Percent of Housing Crowded',fontsize=12)
# hardship_index
ax=axs[1][1]
hardship_index_hex = area_interpolate(source_df=cluster_geo, target_df=cluster_geo_hex, intensive_variables=['hardship_index'])
hardship_index_hex = hardship_index_hex.to_crs(3857)
hardship_index_hex.plot('hardship_index', cmap='magma_r', alpha=0.6, linewidth=0.4, edgecolor="white", ax=ax, legend=True)
ctx.add_basemap(ax=ax, zoom='auto', source=ctx.providers.Stamen.TonerBackground)
ax.axis('off')
ax.set_title('Hardship Index',fontsize=12)
ax=axs[2][0]
count_hex = area_interpolate(source_df=cluster_geo, target_df=cluster_geo_hex, intensive_variables=['domestic'])
count_hex = count_hex.to_crs(3857)
count_hex.plot('domestic', cmap='magma_r', alpha=0.6, linewidth=0.4, edgecolor="white", ax=ax, legend=True)
ctx.add_basemap(ax=ax, zoom='auto', source=ctx.providers.Stamen.TonerBackground)
ax.axis('off')
ax.set_title('Domestic Rate',fontsize=12)
ax=axs[2][1]
weapon_hex = area_interpolate(source_df=cluster_geo, target_df=cluster_geo_hex, intensive_variables=['weapon_rate'])
weapon_hex = weapon_hex.to_crs(3857)
weapon_hex.plot('weapon_rate', cmap='magma_r', alpha=0.6, linewidth=0.4, edgecolor="white", ax=ax, legend=True)
ctx.add_basemap(ax=ax, zoom='auto', source=ctx.providers.Stamen.TonerBackground)
ax.axis('off')
ax.set_title('Weapon Rate',fontsize=12)
plt.savefig('hexgrids_features',bbox_inches='tight')
/Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/geopandas/plotting.py:74: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. /Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/setuptools/_distutils/version.py:346: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. /Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/geopandas/plotting.py:74: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. /Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/setuptools/_distutils/version.py:346: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. /Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/geopandas/plotting.py:74: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. /Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/setuptools/_distutils/version.py:346: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. /Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/geopandas/plotting.py:74: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. /Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/setuptools/_distutils/version.py:346: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. /Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/geopandas/plotting.py:74: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. /Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/setuptools/_distutils/version.py:346: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. /Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/geopandas/plotting.py:74: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. /Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/setuptools/_distutils/version.py:346: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
crime_3857['count'] = crime_3857['count'].astype("Int64")
crime_3857['crime_index'] = crime_3857['count'] / 2740000 * 100000
crime_3857['people_risk_index'] = crime_3857['count'] / (cluster_geo['percent_aged_under_18_or_over_64'] * 2740000) * 100000
crime_3857['weapon_rate'] = crime_3857['weapon_count']/ cluster_geo['count']
crime_3857['weapon_rate'] = crime_3857['weapon_rate'].apply(pd.to_numeric).replace(np.nan, 0)
crime_3857['crime_index'] = crime_3857['crime_index'].apply(pd.to_numeric).replace(np.nan, 0)
crime_3857['people_risk_index'] = crime_3857['people_risk_index'].apply(pd.to_numeric).replace(np.nan, 0)
#select the columns I want to use
feature_cols = [
'community',
'month_count',
'month',
'dow',
'week_count',
'weapon_count',
'count',
'crime_index',
'weapon_rate',
'people_risk_index',
'domestic',
'arrest',
'percent_of_housing_crowded',
'percent_households_below_poverty',
'percent_aged_16_unemployed',
'percent_aged_25_without_high_school_diploma',
'percent_aged_under_18_or_over_64',
'per_capita_income_',
'hardship_index',
"logDistDeposit",
"logDistAbbuilding",
"logDistSchool",
"logDistGrocery",
"logDistSubway",
"logDistRetail",
"logDistAlerts",
"logDistLandmark"]
crime = crime_3857[feature_cols]
crime
| community | month_count | month | dow | week_count | weapon_count | count | crime_index | weapon_rate | people_risk_index | domestic | arrest | percent_of_housing_crowded | percent_households_below_poverty | percent_aged_16_unemployed | percent_aged_25_without_high_school_diploma | percent_aged_under_18_or_over_64 | per_capita_income_ | hardship_index | logDistDeposit | logDistAbbuilding | logDistSchool | logDistGrocery | logDistSubway | logDistRetail | logDistAlerts | logDistLandmark | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | ROGERS PARK | 1378 | 1 | Friday | 2583 | 1309 | 18144 | 662.189781 | 0.100739 | 21.569700 | 0.140267 | 0.171627 | 7.7 | 23.6 | 8.7 | 18.2 | 27.5 | 23939 | 39.0 | 3.188012 | 3.891309 | 2.598127 | 2.655374 | 3.154793 | 2.589933 | 4.108093 | 3.099921 |
| 1 | ROGERS PARK | 1378 | 1 | Monday | 2744 | 1309 | 18144 | 662.189781 | 0.405514 | 16.390836 | 0.140267 | 0.171627 | 7.7 | 23.6 | 8.7 | 18.2 | 27.5 | 23939 | 39.0 | 3.188012 | 3.891309 | 2.598127 | 2.655374 | 3.154793 | 2.589933 | 4.108093 | 3.099921 |
| 2 | ROGERS PARK | 1378 | 1 | Saturday | 2580 | 1309 | 18144 | 662.189781 | 0.300092 | 14.748102 | 0.140267 | 0.171627 | 7.7 | 23.6 | 8.7 | 18.2 | 27.5 | 23939 | 39.0 | 3.188012 | 3.891309 | 2.598127 | 2.655374 | 3.154793 | 2.589933 | 4.108093 | 3.099921 |
| 3 | ROGERS PARK | 1378 | 1 | Sunday | 2549 | 1309 | 18144 | 662.189781 | 0.080549 | 16.764298 | 0.140267 | 0.171627 | 7.7 | 23.6 | 8.7 | 18.2 | 27.5 | 23939 | 39.0 | 3.188012 | 3.891309 | 2.598127 | 2.655374 | 3.154793 | 2.589933 | 4.108093 | 3.099921 |
| 4 | ROGERS PARK | 1378 | 1 | Thursday | 2522 | 1309 | 18144 | 662.189781 | 0.186520 | 18.705926 | 0.140267 | 0.171627 | 7.7 | 23.6 | 8.7 | 18.2 | 27.5 | 23939 | 39.0 | 3.188012 | 3.891309 | 2.598127 | 2.655374 | 3.154793 | 2.589933 | 4.108093 | 3.099921 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 6463 | EDGEWATER | 1081 | 9 | Saturday | 1739 | 682 | 12266 | 447.664234 | 0.000000 | 0.000000 | 0.104598 | 0.159873 | 4.1 | 18.2 | 9.2 | 9.7 | 23.8 | 33385 | 19.0 | 2.708248 | 3.773897 | 2.771839 | 2.893563 | 3.087525 | 2.728233 | 4.023814 | 2.520576 |
| 6464 | EDGEWATER | 1081 | 9 | Sunday | 1591 | 682 | 12266 | 447.664234 | 0.000000 | 0.000000 | 0.104598 | 0.159873 | 4.1 | 18.2 | 9.2 | 9.7 | 23.8 | 33385 | 19.0 | 2.708248 | 3.773897 | 2.771839 | 2.893563 | 3.087525 | 2.728233 | 4.023814 | 2.520576 |
| 6465 | EDGEWATER | 1081 | 9 | Thursday | 1738 | 682 | 12266 | 447.664234 | 0.000000 | 0.000000 | 0.104598 | 0.159873 | 4.1 | 18.2 | 9.2 | 9.7 | 23.8 | 33385 | 19.0 | 2.708248 | 3.773897 | 2.771839 | 2.893563 | 3.087525 | 2.728233 | 4.023814 | 2.520576 |
| 6466 | EDGEWATER | 1081 | 9 | Tuesday | 1756 | 682 | 12266 | 447.664234 | 0.000000 | 0.000000 | 0.104598 | 0.159873 | 4.1 | 18.2 | 9.2 | 9.7 | 23.8 | 33385 | 19.0 | 2.708248 | 3.773897 | 2.771839 | 2.893563 | 3.087525 | 2.728233 | 4.023814 | 2.520576 |
| 6467 | EDGEWATER | 1081 | 9 | Wednesday | 1786 | 682 | 12266 | 447.664234 | 0.000000 | 0.000000 | 0.104598 | 0.159873 | 4.1 | 18.2 | 9.2 | 9.7 | 23.8 | 33385 | 19.0 | 2.708248 | 3.773897 | 2.771839 | 2.893563 | 3.087525 | 2.728233 | 4.023814 | 2.520576 |
6468 rows × 27 columns
plt.figure(figsize=(10,8))
sns.set(font_scale=0.6)
sns.heatmap(crime.corr(), cmap="magma", annot=True, vmin=-1, vmax=1, alpha=0.8, linewidths=0.5, annot_kws={'size': 6})
plt.savefig('correlation_matrix')
/Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/seaborn/rcmod.py:400: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. /Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/setuptools/_distutils/version.py:346: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
# Numerical columns
num_cols = [
'percent_of_housing_crowded',
'percent_households_below_poverty',
'percent_aged_16_unemployed',
'percent_aged_25_without_high_school_diploma',
'percent_aged_under_18_or_over_64',
'per_capita_income_',
'hardship_index',
"logDistDeposit",
"logDistAbbuilding",
"logDistSchool",
"logDistGrocery",
"logDistSubway",
"logDistRetail",
"logDistLandmark",
"logDistAlerts"
]
# Categorical columns
cat_cols = [
'dow',
"community",
]
# Split the data
train_set, test_set = train_test_split(crime, test_size=0.3, random_state=42)
# the target labels: log of count
y_train = np.log(crime['crime_index'])
y_test = np.log(crime['crime_index'])
common_samples = set(train_set.index).intersection(set(y_train.index))
train_set = train_set.loc[common_samples]
y_train = y_train.loc[common_samples]
/var/folders/7b/8810qk1521x704xftzhzw8_40000gn/T/ipykernel_96566/4101391604.py:1: FutureWarning: Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead. /var/folders/7b/8810qk1521x704xftzhzw8_40000gn/T/ipykernel_96566/4101391604.py:2: FutureWarning: Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.
common_samples2 = set(test_set.index).intersection(set(y_test.index))
test_set = test_set.loc[common_samples2]
y_test = y_test.loc[common_samples2]
/var/folders/7b/8810qk1521x704xftzhzw8_40000gn/T/ipykernel_96566/2220761477.py:1: FutureWarning: Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead. /var/folders/7b/8810qk1521x704xftzhzw8_40000gn/T/ipykernel_96566/2220761477.py:2: FutureWarning: Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.
# Set up the column transformer with two transformers
transformer = ColumnTransformer(
transformers=[
("num", StandardScaler(), num_cols),
("cat", OneHotEncoder(handle_unknown="ignore"), cat_cols),
]
)
# Initialize the pipeline
pipe = make_pipeline(
transformer, RandomForestRegressor(random_state=42)
)
# Make the grid of parameters to search
# NOTE: you must prepend the name of the pipeline step
model_name = "randomforestregressor"
param_grid = {
f"{model_name}__n_estimators": [5, 50, 100, 150, 200],
f"{model_name}__max_depth": [2, 5, 7, 9, 13]
}
param_grid
{'randomforestregressor__n_estimators': [5, 50, 100, 150, 200],
'randomforestregressor__max_depth': [2, 5, 7, 9, 13]}
# Create the grid and use 5-fold CV
grid = GridSearchCV(pipe, param_grid, verbose=5, cv=3, error_score='raise')
# Run the search
grid.fit(train_set, y_train)
Fitting 3 folds for each of 25 candidates, totalling 75 fits [CV 1/3] END randomforestregressor__max_depth=2, randomforestregressor__n_estimators=5;, score=0.344 total time= 0.0s [CV 2/3] END randomforestregressor__max_depth=2, randomforestregressor__n_estimators=5;, score=0.125 total time= 0.0s [CV 3/3] END randomforestregressor__max_depth=2, randomforestregressor__n_estimators=5;, score=0.236 total time= 0.0s [CV 1/3] END randomforestregressor__max_depth=2, randomforestregressor__n_estimators=50;, score=0.343 total time= 0.2s [CV 2/3] END randomforestregressor__max_depth=2, randomforestregressor__n_estimators=50;, score=0.126 total time= 0.1s [CV 3/3] END randomforestregressor__max_depth=2, randomforestregressor__n_estimators=50;, score=0.331 total time= 0.1s [CV 1/3] END randomforestregressor__max_depth=2, randomforestregressor__n_estimators=100;, score=0.339 total time= 0.2s [CV 2/3] END randomforestregressor__max_depth=2, randomforestregressor__n_estimators=100;, score=0.125 total time= 0.2s [CV 3/3] END randomforestregressor__max_depth=2, randomforestregressor__n_estimators=100;, score=0.339 total time= 0.3s [CV 1/3] END randomforestregressor__max_depth=2, randomforestregressor__n_estimators=150;, score=0.336 total time= 0.4s [CV 2/3] END randomforestregressor__max_depth=2, randomforestregressor__n_estimators=150;, score=0.123 total time= 0.3s [CV 3/3] END randomforestregressor__max_depth=2, randomforestregressor__n_estimators=150;, score=0.350 total time= 0.3s [CV 1/3] END randomforestregressor__max_depth=2, randomforestregressor__n_estimators=200;, score=0.337 total time= 0.5s [CV 2/3] END randomforestregressor__max_depth=2, randomforestregressor__n_estimators=200;, score=0.122 total time= 0.5s [CV 3/3] END randomforestregressor__max_depth=2, randomforestregressor__n_estimators=200;, score=0.345 total time= 0.4s [CV 1/3] END randomforestregressor__max_depth=5, randomforestregressor__n_estimators=5;, score=0.429 total time= 0.0s [CV 2/3] END randomforestregressor__max_depth=5, randomforestregressor__n_estimators=5;, score=0.085 total time= 0.0s [CV 3/3] END randomforestregressor__max_depth=5, randomforestregressor__n_estimators=5;, score=-0.298 total time= 0.0s [CV 1/3] END randomforestregressor__max_depth=5, randomforestregressor__n_estimators=50;, score=0.377 total time= 0.3s [CV 2/3] END randomforestregressor__max_depth=5, randomforestregressor__n_estimators=50;, score=0.094 total time= 0.3s [CV 3/3] END randomforestregressor__max_depth=5, randomforestregressor__n_estimators=50;, score=0.145 total time= 0.3s [CV 1/3] END randomforestregressor__max_depth=5, randomforestregressor__n_estimators=100;, score=0.383 total time= 0.6s [CV 2/3] END randomforestregressor__max_depth=5, randomforestregressor__n_estimators=100;, score=0.087 total time= 0.9s [CV 3/3] END randomforestregressor__max_depth=5, randomforestregressor__n_estimators=100;, score=0.175 total time= 0.6s [CV 1/3] END randomforestregressor__max_depth=5, randomforestregressor__n_estimators=150;, score=0.373 total time= 0.8s [CV 2/3] END randomforestregressor__max_depth=5, randomforestregressor__n_estimators=150;, score=0.076 total time= 0.9s [CV 3/3] END randomforestregressor__max_depth=5, randomforestregressor__n_estimators=150;, score=0.165 total time= 0.8s [CV 1/3] END randomforestregressor__max_depth=5, randomforestregressor__n_estimators=200;, score=0.374 total time= 1.2s [CV 2/3] END randomforestregressor__max_depth=5, randomforestregressor__n_estimators=200;, score=0.070 total time= 1.4s [CV 3/3] END randomforestregressor__max_depth=5, randomforestregressor__n_estimators=200;, score=0.164 total time= 1.2s [CV 1/3] END randomforestregressor__max_depth=7, randomforestregressor__n_estimators=5;, score=0.296 total time= 0.1s [CV 2/3] END randomforestregressor__max_depth=7, randomforestregressor__n_estimators=5;, score=0.130 total time= 0.1s [CV 3/3] END randomforestregressor__max_depth=7, randomforestregressor__n_estimators=5;, score=-0.185 total time= 0.1s [CV 1/3] END randomforestregressor__max_depth=7, randomforestregressor__n_estimators=50;, score=0.298 total time= 0.5s [CV 2/3] END randomforestregressor__max_depth=7, randomforestregressor__n_estimators=50;, score=0.104 total time= 0.5s [CV 3/3] END randomforestregressor__max_depth=7, randomforestregressor__n_estimators=50;, score=0.173 total time= 0.5s [CV 1/3] END randomforestregressor__max_depth=7, randomforestregressor__n_estimators=100;, score=0.286 total time= 0.9s [CV 2/3] END randomforestregressor__max_depth=7, randomforestregressor__n_estimators=100;, score=0.106 total time= 0.9s [CV 3/3] END randomforestregressor__max_depth=7, randomforestregressor__n_estimators=100;, score=0.211 total time= 0.9s [CV 1/3] END randomforestregressor__max_depth=7, randomforestregressor__n_estimators=150;, score=0.278 total time= 1.4s [CV 2/3] END randomforestregressor__max_depth=7, randomforestregressor__n_estimators=150;, score=0.096 total time= 1.4s [CV 3/3] END randomforestregressor__max_depth=7, randomforestregressor__n_estimators=150;, score=0.216 total time= 1.6s [CV 1/3] END randomforestregressor__max_depth=7, randomforestregressor__n_estimators=200;, score=0.277 total time= 1.9s [CV 2/3] END randomforestregressor__max_depth=7, randomforestregressor__n_estimators=200;, score=0.085 total time= 1.8s [CV 3/3] END randomforestregressor__max_depth=7, randomforestregressor__n_estimators=200;, score=0.216 total time= 1.8s [CV 1/3] END randomforestregressor__max_depth=9, randomforestregressor__n_estimators=5;, score=0.352 total time= 0.1s [CV 2/3] END randomforestregressor__max_depth=9, randomforestregressor__n_estimators=5;, score=0.036 total time= 0.1s [CV 3/3] END randomforestregressor__max_depth=9, randomforestregressor__n_estimators=5;, score=-0.066 total time= 0.1s [CV 1/3] END randomforestregressor__max_depth=9, randomforestregressor__n_estimators=50;, score=0.309 total time= 0.9s [CV 2/3] END randomforestregressor__max_depth=9, randomforestregressor__n_estimators=50;, score=0.089 total time= 0.6s [CV 3/3] END randomforestregressor__max_depth=9, randomforestregressor__n_estimators=50;, score=0.204 total time= 0.6s [CV 1/3] END randomforestregressor__max_depth=9, randomforestregressor__n_estimators=100;, score=0.291 total time= 1.2s [CV 2/3] END randomforestregressor__max_depth=9, randomforestregressor__n_estimators=100;, score=0.091 total time= 1.3s [CV 3/3] END randomforestregressor__max_depth=9, randomforestregressor__n_estimators=100;, score=0.251 total time= 1.2s [CV 1/3] END randomforestregressor__max_depth=9, randomforestregressor__n_estimators=150;, score=0.276 total time= 1.8s [CV 2/3] END randomforestregressor__max_depth=9, randomforestregressor__n_estimators=150;, score=0.083 total time= 2.0s [CV 3/3] END randomforestregressor__max_depth=9, randomforestregressor__n_estimators=150;, score=0.247 total time= 2.1s [CV 1/3] END randomforestregressor__max_depth=9, randomforestregressor__n_estimators=200;, score=0.278 total time= 2.5s [CV 2/3] END randomforestregressor__max_depth=9, randomforestregressor__n_estimators=200;, score=0.078 total time= 2.6s [CV 3/3] END randomforestregressor__max_depth=9, randomforestregressor__n_estimators=200;, score=0.238 total time= 2.6s [CV 1/3] END randomforestregressor__max_depth=13, randomforestregressor__n_estimators=5;, score=0.309 total time= 0.1s [CV 2/3] END randomforestregressor__max_depth=13, randomforestregressor__n_estimators=5;, score=0.144 total time= 0.1s [CV 3/3] END randomforestregressor__max_depth=13, randomforestregressor__n_estimators=5;, score=-0.195 total time= 0.1s [CV 1/3] END randomforestregressor__max_depth=13, randomforestregressor__n_estimators=50;, score=0.299 total time= 0.7s [CV 2/3] END randomforestregressor__max_depth=13, randomforestregressor__n_estimators=50;, score=0.103 total time= 0.8s [CV 3/3] END randomforestregressor__max_depth=13, randomforestregressor__n_estimators=50;, score=0.205 total time= 0.7s [CV 1/3] END randomforestregressor__max_depth=13, randomforestregressor__n_estimators=100;, score=0.292 total time= 1.6s [CV 2/3] END randomforestregressor__max_depth=13, randomforestregressor__n_estimators=100;, score=0.089 total time= 1.6s [CV 3/3] END randomforestregressor__max_depth=13, randomforestregressor__n_estimators=100;, score=0.245 total time= 1.4s [CV 1/3] END randomforestregressor__max_depth=13, randomforestregressor__n_estimators=150;, score=0.282 total time= 2.1s [CV 2/3] END randomforestregressor__max_depth=13, randomforestregressor__n_estimators=150;, score=0.080 total time= 2.2s [CV 3/3] END randomforestregressor__max_depth=13, randomforestregressor__n_estimators=150;, score=0.246 total time= 2.4s [CV 1/3] END randomforestregressor__max_depth=13, randomforestregressor__n_estimators=200;, score=0.281 total time= 2.7s [CV 2/3] END randomforestregressor__max_depth=13, randomforestregressor__n_estimators=200;, score=0.073 total time= 3.0s [CV 3/3] END randomforestregressor__max_depth=13, randomforestregressor__n_estimators=200;, score=0.240 total time= 3.2s
GridSearchCV(cv=3, error_score='raise',
estimator=Pipeline(steps=[('columntransformer',
ColumnTransformer(transformers=[('num',
StandardScaler(),
['percent_of_housing_crowded',
'percent_households_below_poverty',
'percent_aged_16_unemployed',
'percent_aged_25_without_high_school_diploma',
'percent_aged_under_18_or_over_64',
'per_capita_income_',
'hardship_ind...
'logDistGrocery',
'logDistSubway',
'logDistRetail',
'logDistLandmark',
'logDistAlerts']),
('cat',
OneHotEncoder(handle_unknown='ignore'),
['dow',
'community'])])),
('randomforestregressor',
RandomForestRegressor(random_state=42))]),
param_grid={'randomforestregressor__max_depth': [2, 5, 7, 9, 13],
'randomforestregressor__n_estimators': [5, 50, 100,
150, 200]},
verbose=5)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. GridSearchCV(cv=3, error_score='raise',
estimator=Pipeline(steps=[('columntransformer',
ColumnTransformer(transformers=[('num',
StandardScaler(),
['percent_of_housing_crowded',
'percent_households_below_poverty',
'percent_aged_16_unemployed',
'percent_aged_25_without_high_school_diploma',
'percent_aged_under_18_or_over_64',
'per_capita_income_',
'hardship_ind...
'logDistGrocery',
'logDistSubway',
'logDistRetail',
'logDistLandmark',
'logDistAlerts']),
('cat',
OneHotEncoder(handle_unknown='ignore'),
['dow',
'community'])])),
('randomforestregressor',
RandomForestRegressor(random_state=42))]),
param_grid={'randomforestregressor__max_depth': [2, 5, 7, 9, 13],
'randomforestregressor__n_estimators': [5, 50, 100,
150, 200]},
verbose=5)Pipeline(steps=[('columntransformer',
ColumnTransformer(transformers=[('num', StandardScaler(),
['percent_of_housing_crowded',
'percent_households_below_poverty',
'percent_aged_16_unemployed',
'percent_aged_25_without_high_school_diploma',
'percent_aged_under_18_or_over_64',
'per_capita_income_',
'hardship_index',
'logDistDeposit',
'logDistAbbuilding',
'logDistSchool',
'logDistGrocery',
'logDistSubway',
'logDistRetail',
'logDistLandmark',
'logDistAlerts']),
('cat',
OneHotEncoder(handle_unknown='ignore'),
['dow', 'community'])])),
('randomforestregressor',
RandomForestRegressor(random_state=42))])ColumnTransformer(transformers=[('num', StandardScaler(),
['percent_of_housing_crowded',
'percent_households_below_poverty',
'percent_aged_16_unemployed',
'percent_aged_25_without_high_school_diploma',
'percent_aged_under_18_or_over_64',
'per_capita_income_', 'hardship_index',
'logDistDeposit', 'logDistAbbuilding',
'logDistSchool', 'logDistGrocery',
'logDistSubway', 'logDistRetail',
'logDistLandmark', 'logDistAlerts']),
('cat', OneHotEncoder(handle_unknown='ignore'),
['dow', 'community'])])['percent_of_housing_crowded', 'percent_households_below_poverty', 'percent_aged_16_unemployed', 'percent_aged_25_without_high_school_diploma', 'percent_aged_under_18_or_over_64', 'per_capita_income_', 'hardship_index', 'logDistDeposit', 'logDistAbbuilding', 'logDistSchool', 'logDistGrocery', 'logDistSubway', 'logDistRetail', 'logDistLandmark', 'logDistAlerts']
StandardScaler()
['dow', 'community']
OneHotEncoder(handle_unknown='ignore')
RandomForestRegressor(random_state=42)
# Evaluate the best random forest model
best_random = grid.best_estimator_
best_random
Pipeline(steps=[('columntransformer',
ColumnTransformer(transformers=[('num', StandardScaler(),
['percent_of_housing_crowded',
'percent_households_below_poverty',
'percent_aged_16_unemployed',
'percent_aged_25_without_high_school_diploma',
'percent_aged_under_18_or_over_64',
'per_capita_income_',
'hardship_index',
'logDistDeposit',
'logDistAbbuilding',
'logDistSchool',
'logDistGrocery',
'logDistSubway',
'logDistRetail',
'logDistLandmark',
'logDistAlerts']),
('cat',
OneHotEncoder(handle_unknown='ignore'),
['dow', 'community'])])),
('randomforestregressor',
RandomForestRegressor(max_depth=2, n_estimators=150,
random_state=42))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. Pipeline(steps=[('columntransformer',
ColumnTransformer(transformers=[('num', StandardScaler(),
['percent_of_housing_crowded',
'percent_households_below_poverty',
'percent_aged_16_unemployed',
'percent_aged_25_without_high_school_diploma',
'percent_aged_under_18_or_over_64',
'per_capita_income_',
'hardship_index',
'logDistDeposit',
'logDistAbbuilding',
'logDistSchool',
'logDistGrocery',
'logDistSubway',
'logDistRetail',
'logDistLandmark',
'logDistAlerts']),
('cat',
OneHotEncoder(handle_unknown='ignore'),
['dow', 'community'])])),
('randomforestregressor',
RandomForestRegressor(max_depth=2, n_estimators=150,
random_state=42))])ColumnTransformer(transformers=[('num', StandardScaler(),
['percent_of_housing_crowded',
'percent_households_below_poverty',
'percent_aged_16_unemployed',
'percent_aged_25_without_high_school_diploma',
'percent_aged_under_18_or_over_64',
'per_capita_income_', 'hardship_index',
'logDistDeposit', 'logDistAbbuilding',
'logDistSchool', 'logDistGrocery',
'logDistSubway', 'logDistRetail',
'logDistLandmark', 'logDistAlerts']),
('cat', OneHotEncoder(handle_unknown='ignore'),
['dow', 'community'])])['percent_of_housing_crowded', 'percent_households_below_poverty', 'percent_aged_16_unemployed', 'percent_aged_25_without_high_school_diploma', 'percent_aged_under_18_or_over_64', 'per_capita_income_', 'hardship_index', 'logDistDeposit', 'logDistAbbuilding', 'logDistSchool', 'logDistGrocery', 'logDistSubway', 'logDistRetail', 'logDistLandmark', 'logDistAlerts']
StandardScaler()
['dow', 'community']
OneHotEncoder(handle_unknown='ignore')
RandomForestRegressor(max_depth=2, n_estimators=150, random_state=42)
grid.best_estimator_["randomforestregressor"]
RandomForestRegressor(max_depth=2, n_estimators=150, random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
RandomForestRegressor(max_depth=2, n_estimators=150, random_state=42)
grid.best_params_
{'randomforestregressor__max_depth': 2,
'randomforestregressor__n_estimators': 150}
grid.score(test_set, y_test)
0.5438955252961919
pipe.fit(train_set, y_train)
Pipeline(steps=[('columntransformer',
ColumnTransformer(transformers=[('num', StandardScaler(),
['percent_of_housing_crowded',
'percent_households_below_poverty',
'percent_aged_16_unemployed',
'percent_aged_25_without_high_school_diploma',
'percent_aged_under_18_or_over_64',
'per_capita_income_',
'hardship_index',
'logDistDeposit',
'logDistAbbuilding',
'logDistSchool',
'logDistGrocery',
'logDistSubway',
'logDistRetail',
'logDistLandmark',
'logDistAlerts']),
('cat',
OneHotEncoder(handle_unknown='ignore'),
['dow', 'community'])])),
('randomforestregressor',
RandomForestRegressor(random_state=42))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. Pipeline(steps=[('columntransformer',
ColumnTransformer(transformers=[('num', StandardScaler(),
['percent_of_housing_crowded',
'percent_households_below_poverty',
'percent_aged_16_unemployed',
'percent_aged_25_without_high_school_diploma',
'percent_aged_under_18_or_over_64',
'per_capita_income_',
'hardship_index',
'logDistDeposit',
'logDistAbbuilding',
'logDistSchool',
'logDistGrocery',
'logDistSubway',
'logDistRetail',
'logDistLandmark',
'logDistAlerts']),
('cat',
OneHotEncoder(handle_unknown='ignore'),
['dow', 'community'])])),
('randomforestregressor',
RandomForestRegressor(random_state=42))])ColumnTransformer(transformers=[('num', StandardScaler(),
['percent_of_housing_crowded',
'percent_households_below_poverty',
'percent_aged_16_unemployed',
'percent_aged_25_without_high_school_diploma',
'percent_aged_under_18_or_over_64',
'per_capita_income_', 'hardship_index',
'logDistDeposit', 'logDistAbbuilding',
'logDistSchool', 'logDistGrocery',
'logDistSubway', 'logDistRetail',
'logDistLandmark', 'logDistAlerts']),
('cat', OneHotEncoder(handle_unknown='ignore'),
['dow', 'community'])])['percent_of_housing_crowded', 'percent_households_below_poverty', 'percent_aged_16_unemployed', 'percent_aged_25_without_high_school_diploma', 'percent_aged_under_18_or_over_64', 'per_capita_income_', 'hardship_index', 'logDistDeposit', 'logDistAbbuilding', 'logDistSchool', 'logDistGrocery', 'logDistSubway', 'logDistRetail', 'logDistLandmark', 'logDistAlerts']
StandardScaler()
['dow', 'community']
OneHotEncoder(handle_unknown='ignore')
RandomForestRegressor(random_state=42)
# Get the R2 on the test set!
best_random.score(test_set, y_test)
0.5438955252961919
ohe = transformer.named_transformers_['cat']
ohe
OneHotEncoder(handle_unknown='ignore')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
OneHotEncoder(handle_unknown='ignore')
# One column for each category type!
ohe_cols = ohe.get_feature_names_out()
ohe_cols
array(['dow_Friday', 'dow_Monday', 'dow_Saturday', 'dow_Sunday',
'dow_Thursday', 'dow_Tuesday', 'dow_Wednesday',
'community_ALBANY PARK', 'community_ARCHER HEIGHTS',
'community_ARMOUR SQUARE', 'community_ASHBURN',
'community_AUBURN GRESHAM', 'community_AUSTIN',
'community_AVALON PARK', 'community_AVONDALE',
'community_BELMONT CRAGIN', 'community_BEVERLY',
'community_BRIDGEPORT', 'community_BRIGHTON PARK',
'community_BURNSIDE', 'community_CALUMET HEIGHTS',
'community_CHATHAM', 'community_CHICAGO LAWN',
'community_CLEARING', 'community_DOUGLAS', 'community_DUNNING',
'community_EAST GARFIELD PARK', 'community_EAST SIDE',
'community_EDGEWATER', 'community_EDISON PARK',
'community_ENGLEWOOD', 'community_FOREST GLEN',
'community_FULLER PARK', 'community_GAGE PARK',
'community_GARFIELD RIDGE', 'community_GRAND BOULEVARD',
'community_GREATER GRAND CROSSING', 'community_HEGEWISCH',
'community_HERMOSA', 'community_HUMBOLDT PARK',
'community_HYDE PARK', 'community_IRVING PARK',
'community_JEFFERSON PARK', 'community_KENWOOD',
'community_LAKE VIEW', 'community_LINCOLN PARK',
'community_LINCOLN SQUARE', 'community_LOGAN SQUARE',
'community_LOOP', 'community_LOWER WEST SIDE',
'community_MCKINLEY PARK', 'community_MONTCLARE',
'community_MORGAN PARK', 'community_MOUNT GREENWOOD',
'community_NEAR NORTH SIDE', 'community_NEAR SOUTH SIDE',
'community_NEAR WEST SIDE', 'community_NEW CITY',
'community_NORTH CENTER', 'community_NORTH LAWNDALE',
'community_NORTH PARK', 'community_NORWOOD PARK',
'community_OAKLAND', 'community_OHARE', 'community_PORTAGE PARK',
'community_PULLMAN', 'community_RIVERDALE',
'community_ROGERS PARK', 'community_ROSELAND',
'community_SOUTH CHICAGO', 'community_SOUTH DEERING',
'community_SOUTH LAWNDALE', 'community_SOUTH SHORE',
'community_UPTOWN', 'community_WASHINGTON HEIGHTS',
'community_WASHINGTON PARK', 'community_WEST ELSDON',
'community_WEST ENGLEWOOD', 'community_WEST GARFIELD PARK',
'community_WEST LAWN', 'community_WEST PULLMAN',
'community_WEST RIDGE', 'community_WEST TOWN',
'community_WOODLAWN'], dtype=object)
features = num_cols + list(ohe_cols)
random_forest = pipe["randomforestregressor"]
# Create the dataframe with importances
importance = pd.DataFrame(
{"Feature": features, "Importance": random_forest.feature_importances_}
)
# Sort by importance and get the top 20, and sort in a descending order
importance = importance.sort_values("Importance", ascending=False).iloc[:20]
# Plot
importance.hvplot.barh(x="Feature", y="Importance", color="purple", alpha=0.5, edgecolor="purple", height=500, flip_yaxis=True)
WARNING:param.main: edgecolor option not found for barh plot with bokeh; similar options include: ['color', 'color', 'muted_color']
# Make the predictions
predictions = best_random.predict(test_set)
# MAE
# NOTE: we convert from log(sale_price) to sale_price using exp()
errors = (np.exp(predictions) - np.exp(y_test)) / np.exp(y_test) * 100
#extract out the test set data
crime_test = crime.loc[test_set.index]
#include the percent error for all of the sales in the test set
crime_test['percent_error'] = errors
crime_test['prediction'] = np.exp(predictions)
crime_test['abs_percent_error']=abs(crime_test['percent_error'])
crime_test
| community | month_count | month | dow | week_count | weapon_count | count | crime_index | weapon_rate | people_risk_index | domestic | arrest | percent_of_housing_crowded | percent_households_below_poverty | percent_aged_16_unemployed | percent_aged_25_without_high_school_diploma | percent_aged_under_18_or_over_64 | per_capita_income_ | hardship_index | logDistDeposit | logDistAbbuilding | logDistSchool | logDistGrocery | logDistSubway | logDistRetail | logDistAlerts | logDistLandmark | percent_error | prediction | abs_percent_error | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 8 | ROGERS PARK | 1649 | 10 | Monday | 2744 | 1309 | 18144 | 662.189781 | 0.072694 | 18.343207 | 0.140267 | 0.171627 | 7.7 | 23.6 | 8.7 | 18.2 | 27.5 | 23939 | 39.0 | 3.188012 | 3.891309 | 2.598127 | 2.655374 | 3.154793 | 2.589933 | 4.108093 | 3.099921 | -34.095150 | 436.415181 | 34.095150 |
| 12 | ROGERS PARK | 1649 | 10 | Tuesday | 2547 | 1309 | 18144 | 662.189781 | 0.292057 | 16.979225 | 0.140267 | 0.171627 | 7.7 | 23.6 | 8.7 | 18.2 | 27.5 | 23939 | 39.0 | 3.188012 | 3.891309 | 2.598127 | 2.655374 | 3.154793 | 2.589933 | 4.108093 | 3.099921 | -34.095150 | 436.415181 | 34.095150 |
| 14 | ROGERS PARK | 1384 | 11 | Friday | 2583 | 1309 | 18144 | 662.189781 | 0.082162 | 19.476170 | 0.140267 | 0.171627 | 7.7 | 23.6 | 8.7 | 18.2 | 27.5 | 23939 | 39.0 | 3.188012 | 3.891309 | 2.598127 | 2.655374 | 3.154793 | 2.589933 | 4.108093 | 3.099921 | -34.095150 | 436.415181 | 34.095150 |
| 15 | ROGERS PARK | 1384 | 11 | Monday | 2744 | 1309 | 18144 | 662.189781 | 0.092411 | 20.955373 | 0.140267 | 0.171627 | 7.7 | 23.6 | 8.7 | 18.2 | 27.5 | 23939 | 39.0 | 3.188012 | 3.891309 | 2.598127 | 2.655374 | 3.154793 | 2.589933 | 4.108093 | 3.099921 | -34.095150 | 436.415181 | 34.095150 |
| 17 | ROGERS PARK | 1384 | 11 | Sunday | 2549 | 1309 | 18144 | 662.189781 | 0.450138 | 17.155176 | 0.140267 | 0.171627 | 7.7 | 23.6 | 8.7 | 18.2 | 27.5 | 23939 | 39.0 | 3.188012 | 3.891309 | 2.598127 | 2.655374 | 3.154793 | 2.589933 | 4.108093 | 3.099921 | -34.095150 | 436.415181 | 34.095150 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 6460 | EDGEWATER | 1183 | 8 | Wednesday | 1786 | 682 | 12266 | 447.664234 | 0.000000 | 0.000000 | 0.104598 | 0.159873 | 4.1 | 18.2 | 9.2 | 9.7 | 23.8 | 33385 | 19.0 | 2.708248 | 3.773897 | 2.771839 | 2.893563 | 3.087525 | 2.728233 | 4.023814 | 2.520576 | -2.512832 | 436.415181 | 2.512832 |
| 6461 | EDGEWATER | 1081 | 9 | Friday | 1908 | 682 | 12266 | 447.664234 | 0.000000 | 0.000000 | 0.104598 | 0.159873 | 4.1 | 18.2 | 9.2 | 9.7 | 23.8 | 33385 | 19.0 | 2.708248 | 3.773897 | 2.771839 | 2.893563 | 3.087525 | 2.728233 | 4.023814 | 2.520576 | -2.512832 | 436.415181 | 2.512832 |
| 6462 | EDGEWATER | 1081 | 9 | Monday | 1748 | 682 | 12266 | 447.664234 | 0.000000 | 0.000000 | 0.104598 | 0.159873 | 4.1 | 18.2 | 9.2 | 9.7 | 23.8 | 33385 | 19.0 | 2.708248 | 3.773897 | 2.771839 | 2.893563 | 3.087525 | 2.728233 | 4.023814 | 2.520576 | -2.512832 | 436.415181 | 2.512832 |
| 6464 | EDGEWATER | 1081 | 9 | Sunday | 1591 | 682 | 12266 | 447.664234 | 0.000000 | 0.000000 | 0.104598 | 0.159873 | 4.1 | 18.2 | 9.2 | 9.7 | 23.8 | 33385 | 19.0 | 2.708248 | 3.773897 | 2.771839 | 2.893563 | 3.087525 | 2.728233 | 4.023814 | 2.520576 | -2.512832 | 436.415181 | 2.512832 |
| 6465 | EDGEWATER | 1081 | 9 | Thursday | 1738 | 682 | 12266 | 447.664234 | 0.000000 | 0.000000 | 0.104598 | 0.159873 | 4.1 | 18.2 | 9.2 | 9.7 | 23.8 | 33385 | 19.0 | 2.708248 | 3.773897 | 2.771839 | 2.893563 | 3.087525 | 2.728233 | 4.023814 | 2.520576 | -2.512832 | 436.415181 | 2.512832 |
1941 rows × 30 columns
error_test = crime_test.groupby("community")["abs_percent_error"].mean().reset_index()
error_test = boundary.merge(error_test,on="community")
error_test = error_test.rename(columns={'area_numbe': 'community_area'}).drop(columns=['area', 'shape_area','perimeter','area_num_1','comarea_id','comarea','shape_len'])
error_test
| community | community_area | geometry | abs_percent_error | |
|---|---|---|---|---|
| 0 | DOUGLAS | 35 | MULTIPOLYGON (((-87.60914 41.84469, -87.60915 ... | 46.250908 |
| 1 | OAKLAND | 36 | MULTIPOLYGON (((-87.59215 41.81693, -87.59231 ... | 24.615683 |
| 2 | FULLER PARK | 37 | MULTIPOLYGON (((-87.62880 41.80189, -87.62879 ... | 29.125332 |
| 3 | GRAND BOULEVARD | 38 | MULTIPOLYGON (((-87.60671 41.81681, -87.60670 ... | 19.231095 |
| 4 | KENWOOD | 39 | MULTIPOLYGON (((-87.59215 41.81693, -87.59215 ... | 70.387232 |
| ... | ... | ... | ... | ... |
| 72 | MOUNT GREENWOOD | 74 | MULTIPOLYGON (((-87.69646 41.70714, -87.69644 ... | 55.974961 |
| 73 | MORGAN PARK | 75 | MULTIPOLYGON (((-87.64215 41.68508, -87.64249 ... | 19.973673 |
| 74 | OHARE | 76 | MULTIPOLYGON (((-87.83658 41.98640, -87.83658 ... | 47.622471 |
| 75 | EDGEWATER | 77 | MULTIPOLYGON (((-87.65456 41.99817, -87.65456 ... | 2.512832 |
| 76 | EDISON PARK | 9 | MULTIPOLYGON (((-87.80676 42.00084, -87.80676 ... | 221.550299 |
77 rows × 4 columns
error_test_hex = h3fy(error_test, resolution=8)
/Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/tobler/util/util.py:141: FutureWarning: The default dtype for empty Series will be 'object' instead of 'float64' in a future version. Specify a dtype explicitly to silence this warning. /Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/tobler/util/util.py:141: FutureWarning: The default dtype for empty Series will be 'object' instead of 'float64' in a future version. Specify a dtype explicitly to silence this warning.
# Plot all features in the same figure
fig, ax = plt.subplots(ncols=1,nrows=1, figsize=(7,7))
# MAE
ax=ax
MAE_hex = area_interpolate(source_df=error_test, target_df=error_test_hex, intensive_variables=['abs_percent_error'])
MAE_hex = MAE_hex.to_crs(3857)
MAE_hex.plot('abs_percent_error', cmap='magma_r', alpha=0.6, linewidth=0.6, edgecolor="white", ax=ax, legend=True, legend_kwds=dict(loc='upper right'),scheme="Quantiles", k=10)
ctx.add_basemap(ax=ax, zoom='auto', source=ctx.providers.Stamen.TonerBackground)
ax.axis('off')
ax.set_title('Distribution of MAE',fontsize=15)
plt.savefig('hexgrids_MAE')
/var/folders/7b/8810qk1521x704xftzhzw8_40000gn/T/ipykernel_96566/4013683622.py:6: UserWarning: Geometry is in a geographic CRS. Results from 'area' are likely incorrect. Use 'GeoSeries.to_crs()' to re-project geometries to a projected CRS before this operation. /Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/tobler/area_weighted/area_interpolate.py:308: UserWarning: Geometry is in a geographic CRS. Results from 'area' are likely incorrect. Use 'GeoSeries.to_crs()' to re-project geometries to a projected CRS before this operation. /Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/geopandas/plotting.py:740: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. /Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/setuptools/_distutils/version.py:346: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. /Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/geopandas/plotting.py:74: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. /Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/setuptools/_distutils/version.py:346: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.